Introductions and Tutorials With DirectX 9
Introductions and Tutorials With DirectX 9
Introductions and Tutorials With DirectX 9
Edited by
Wolfgang F. Engel
Edited by
Wolfgang F. Engel
Library of Congress Cataloging-in-Publication Data ShaderX2 : introductions and tutorials with DirectX 9 / edited by Wolfgang Engel. p. cm. Includes bibliographical references and index. ISBN 1-55622-902-X (paperback, companion CD-ROM) 1. Computer games--Programming. 2. Three-dimensional display systems. 3. DirectX. I. Engel, Wolfgang F. QA76.76.C672S47 2003 794.8'16693--dc22 2003016311 CIP
2004, Wordware Publishing, Inc. All Rights Reserved 2320 Los Rios Boulevard Plano, Texas 75074 No part of this book may be reproduced in any form or by any means without permission in writing from Wordware Publishing, Inc. Printed in the United States of America
ISBN 1-55622-902-X
10 9 8 7 6 5 4 3 2 1
0307
Crystal Reports is a registered trademark of Crystal Decisions, Inc. in the United States and/or other countries. Names of Crystal Decisions products referenced herein are trademarks or registered trademarks of Crystal Decisions or its Screen shots used in this book remain the property of their respective companies. All brand names and product names mentioned in this book are trademarks or service marks of their respective companies. Any omission or misuse (of any kind) of service marks or trademarks should not be regarded as intent to infringe on the property of others. The publisher recognizes and respects all marks used by companies, manufacturers, and developers as a means to distinguish their products. This book is sold as is, without warranty of any kind, either express or implied, respecting the contents of this book and any disks or programs that may accompany it, including but not limited to implied warranties for the books quality, performance, merchantability, or fitness for any particular purpose. Neither Wordware Publishing, Inc. nor its dealers or distributors shall be liable to the purchaser or any other person or entity with respect to any liability, loss, or damage caused or alleged to have been caused directly or indirectly by this book.
All inquiries for volume purchases of this book should be addressed to Wordware Publishing, Inc., at the above address. Telephone inquiries may be made by calling:
(972) 423-0090
Articles
Introduction to the DirectX High Level Shading Language Craig Peeper and Jason L. Mitchell Introduction to the vs_3_0 and ps_3_0 Shader Models Nicolas Thibieroz, Kristof Beets, and Aaron Burton Advanced Lighting and Shading with Direct3D 9 Michal Valient Introduction to Different Fog Effects Markus Nuebel Shadow Mapping with Direct3D 9 Michal Valient The Theory of Stencil Shadow Volumes Hun Yen Kwoon Shader Development Using RenderMonkey Natalya Tatarchuk Tips for Creating Shader-Friendly 3D Models Gim Guan Chua 1
63
83
151
181
197
279
339
Contents
Preface About the Authors Introduction xiii xvii xxi
Introduction to the DirectX High Level Shading Language 1 Craig Peeper and Jason L. Mitchell Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 1 A Simple Example . . . . . . . . . . . . . . . . . . . . . . . 2 Assembly Language and Compile Targets. . . . . . . . . . . . 4 Hardware Realities . . . . . . . . . . . . . . . . . . . . . 6 Compilation Failure . . . . . . . . . . . . . . . . . . . . . 6 The Command-line Compiler fxc . . . . . . . . . . . . . 7 Language Basics . . . . . . . . . . . . . . . . . . . . . . . . 8 Keywords . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Data Types . . . . . . . . . . . . . . . . . . . . . . . . . 9 Type Modifiers . . . . . . . . . . . . . . . . . . . . . . . 12 Storage Class Modifiers . . . . . . . . . . . . . . . . . . 13 Initializers . . . . . . . . . . . . . . . . . . . . . . . . . 14 Working with Vectors . . . . . . . . . . . . . . . . . . . . 14 Constructors . . . . . . . . . . . . . . . . . . . . . . . . 15 Type Casting . . . . . . . . . . . . . . . . . . . . . . . . 15 Structures . . . . . . . . . . . . . . . . . . . . . . . . . 17 Samplers . . . . . . . . . . . . . . . . . . . . . . . . . 17 Intrinsics . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Math Intrinsics . . . . . . . . . . . . . . . . . . . . . . . 20 Texture Sampling Intrinsics . . . . . . . . . . . . . . . . . 23 Shader Inputs. . . . . . . . . . . . . . . . . . . . . . . . . 25 Uniform Input . . . . . . . . . . . . . . . . . . . . . . . 25 Varying Input . . . . . . . . . . . . . . . . . . . . . . . 27 Shader Outputs. . . . . . . . . . . . . . . . . . . . . . . . 29 An Example Shader . . . . . . . . . . . . . . . . . . . . . . 31 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 39 Matrix Data Type Usage . . . . . . . . . . . . . . . . . . 40
vii
Contents
Integer Data Type Usage . . . . . . . . . . . . . . Flow Control and Performance . . . . . . . . . . . Importance of Input Type Declarations . . . . . . . Precision Issues (logp, expp, lit) . . . . . . . . . . Using the ps_1_x Compile Targets . . . . . . . . . Strategy for Targeting ps_1_x . . . . . . . . . . . . Integration into an Engine Using D3DX Effects . . . . . Effect Files . . . . . . . . . . . . . . . . . . . . . The Effect API . . . . . . . . . . . . . . . . . . . Integration into an Engine without Using D3DX Effects . The Constant Table . . . . . . . . . . . . . . . . SDK Updates . . . . . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . Acknowledgments . . . . . . . . . . . . . . . . . . Introduction to the vs_3_0 and ps_3_0 Shader Models Nicolas Thibieroz, Kristof Beets, and Aaron Burton Introduction . . . . . . . . . . . . . . . . . . . . . Features Common to vs_3_0 and ps_3_0 . . . . . . . Flexible Input and Output Declarations . . . . . . . Predication . . . . . . . . . . . . . . . . . . . . Static and Dynamic Flow Control . . . . . . . . . . Arbitrary Swizzle . . . . . . . . . . . . . . . . . . Destination Write Masks on Texture Instructions . . . vs_3_0 Features . . . . . . . . . . . . . . . . . . . Registers . . . . . . . . . . . . . . . . . . . . . . Instructions . . . . . . . . . . . . . . . . . . . . Texture Sampling. . . . . . . . . . . . . . . . . . Vertex Stream Frequency . . . . . . . . . . . . . . ps_3_0 Features . . . . . . . . . . . . . . . . . . . Registers . . . . . . . . . . . . . . . . . . . . . . Instructions . . . . . . . . . . . . . . . . . . . . Unlimited Texture Samples and Dependent Reads . . Conclusion . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
41 42 44 45 46 51 51 52 57 58 59 61 61 61 63
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
63 64 64 65 66 69 70 71 71 73 73 76 78 78 80 82 82 82
Advanced Lighting and Shading with Direct3D 9 83 Michal Valient Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 83 Per-Pixel Phong . . . . . . . . . . . . . . . . . . . . . . . . 84 Phongs Lighting Equation . . . . . . . . . . . . . . . . . 84 Vertex and Pixel Shaders 2.0 . . . . . . . . . . . . . . . . 85 Vertex and Pixel Shaders 3.0 . . . . . . . . . . . . . . . . 97 Per-pixel Environment Bump Mapping with Fresnel Term . . . 108 Mathematical Background . . . . . . . . . . . . . . . . 109
viii
Contents
Vertex Shader. . . . . . . . . Pixel Shader 1.4 . . . . . . . Pixel Shader 2.0 . . . . . . . HLSL Version . . . . . . . . . Background for Advanced Models Spherical Coordinates . . . . Roughness of a Surface . . . . Masking and Shadowing . . . The Oren-Nayar Model . . . . . Shaders . . . . . . . . . . . HLSL Version . . . . . . . . . Cook-Torrance Model . . . . . . Shaders 2.0 . . . . . . . . . Shaders 1.4 . . . . . . . . . HLSL Version . . . . . . . . . Quality Comparison . . . . . Conclusion . . . . . . . . . . . References . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
112 115 117 119 122 122 123 124 125 127 131 134 136 140 143 147 148 149 151
Introduction to Different Fog Effects Markus Nuebel Introduction . . . . . . . . . . . . . . . . The Theory behind Fog Calculations . . . . Technique One: Linear Fog . . . . . . . . Fog Equation . . . . . . . . . . . . . . Implementation. . . . . . . . . . . . . Technique Two: Exponential Fog . . . . . . Fog Equation . . . . . . . . . . . . . . Implementation. . . . . . . . . . . . . Technique Three: Exponential Squared Fog . Fog Equation . . . . . . . . . . . . . . Implementation. . . . . . . . . . . . . Technique Four: Layered Fog . . . . . . . Theory and Equations. . . . . . . . . . Implementation. . . . . . . . . . . . . Technique Five: Animated Fog . . . . . . . Theory and Equations. . . . . . . . . . Implementation. . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
151 152 154 154 155 157 158 159 162 163 164 166 167 168 174 175 176 178 179
Shadow Mapping with Direct3D 9 181 Michal Valient Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 181 Shadow Algorithm. . . . . . . . . . . . . . . . . . . . . . 182
ix
Contents
Depth Bias Problem . . . . . . . . Shadow Map Filtering . . . . . . . Shaders for Shadow Map Creation. Shaders for Final Rendering . . . . Conclusion . . . . . . . . . . . . References . . . . . . . . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
The Theory of Stencil Shadow Volumes Hun Yen Kwoon Introduction . . . . . . . . . . . . . . . . . . . . . Shadow Volume Concept . . . . . . . . . . . . . . Depth-pass (z-pass). . . . . . . . . . . . . . . . Depth-fail (z-fail) . . . . . . . . . . . . . . . . . Problems and Solutions . . . . . . . . . . . . . . . Finite Shadow Cover . . . . . . . . . . . . . . . Ghost Shadow . . . . . . . . . . . . . . . . . . View Frustum Clipping . . . . . . . . . . . . . . Implementation on CPU . . . . . . . . . . . . . . . How It Is Done . . . . . . . . . . . . . . . . . . Silhouette Determination . . . . . . . . . . . . . Forming the Shadow Volume . . . . . . . . . . . Shadow Volume Capping . . . . . . . . . . . . . Depth-pass Stenciling Operations (DepthPassCPU). Depth-fail Stenciling Operations (DepthFailCPU). . Rendering Shadow Volume Capping . . . . . . . Implementation on GPU (Shaders) . . . . . . . . . . How It Is Done . . . . . . . . . . . . . . . . . . Preprocessing of Data . . . . . . . . . . . . . . Forming Shadow Volume in Shaders . . . . . . . Vertex Shader Implementation (FiniteGPU). . . . . Vertex Shader Implementation (InfiniteGPU) . . . . Better with Shaders? . . . . . . . . . . . . . . . DirectX 9 HLSL Samples . . . . . . . . . . . . . . . Efficiency and Robustness . . . . . . . . . . . . . . Use Less for More . . . . . . . . . . . . . . . . Cheat Whenever You Can . . . . . . . . . . . . Fighting the Invisible . . . . . . . . . . . . . . . Scene Management Inside and Out . . . . . . . . Always a Good Switch . . . . . . . . . . . . . . Mix and Match . . . . . . . . . . . . . . . . . . The End . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
197 199 201 205 209 209 210 212 220 220 221 225 231 233 238 241 243 244 245 249 250 256 260 262 267 267 269 270 271 275 275 275 276
Contents
Shader Development Using RenderMonkey Natalya Tatarchuk Introduction . . . . . . . . . . . . . . . . . Overview of the IDE . . . . . . . . . . . . . Creation of Basic Illumination Effect . . . . . Run-Time Database Overview . . . . . . . . Workspace View . . . . . . . . . . . . . Variable Creation and Management . . . Predefined RenderMonkey Variables. . . . Stream Mapping Module . . . . . . . . . Model Management . . . . . . . . . . . Managing Effects . . . . . . . . . . . . . Pixel and Vertex Shaders . . . . . . . . . . . Editing Shaders . . . . . . . . . . . . . . Vertex Shader Setup and Editing. . . . . . Compiling Your Shaders . . . . . . . . . Output Window . . . . . . . . . . . . . Shader Assembly or Compilation Errors . . Editing Assembly . . . . . . . . . . . . . Pixel Shader Setup and Editing . . . . . . Preview Window . . . . . . . . . . . . . Editing Variables . . . . . . . . . . . . . Render State Block Management . . . . . . . Texturing in RenderMonkey . . . . . . . . . Texture Objects . . . . . . . . . . . . . . Using Textures with HLSL Shaders . . . . . Rendering to a Texture . . . . . . . . . . . . Render Passes . . . . . . . . . . . . . . Renderable Texture Support . . . . . . . . Editing a Renderable Texture . . . . . . . . . Editing a Render Target . . . . . . . . . . . Artist Editor . . . . . . . . . . . . . . . . Editing Variables in the Artist Editor Module Summary . . . . . . . . . . . . . . . . . . Tips for Creating Shader-Friendly 3D Models Gim Guan Chua Generating Suitable Texture Coordinates . . . The Influence of Vertex Weight . . . . . . . Problems with Non-Convex Surfaces . . . . . Conclusion . . . . . . . . . . . . . . . . .
279 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279 281 282 283 285 286 288 290 293 294 295 296 298 302 302 302 303 306 308 310 314 317 318 322 324 324 325 331 332 332 334 337 339 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340 341 343 345
Index
347
xi
Preface
After the tremendous success of Direct3D ShaderX: Vertex and Pixel Shader Tips and Tricks, I planned to do another book with an entirely new set of innovative ideas, techniques, and algorithms. The call for authors led to many proposals from nearly 80 people who wanted to contribute to the book. Some of these proposals featured introductory material and others featured much more advanced themes. Because of the large amount of material, I decided to split the articles into introductory pieces that are much longer but explain a lot of groundwork and articles that assume a certain degree of knowledge. This idea led to two books: ShaderX2: Introductions & Tutorials with DirectX 9 ShaderX2: Shader Programming Tips & Tricks with DirectX 9 The first book (this one) helps the reader get started with shader programming, whereas the second book features tips and tricks that an experienced shader programmer will benefit from. As with Direct3D ShaderX, Javier Izquierdo Villagrn ([email protected]) prepared the drafts for the cover design of both books with in-game screen shots from Aquanox 2, which were contributed by Ingo Frick, the technical director of Massive Development. A number of people have enthusiastically contributed to both books: Wessam Bahnassi Andre Chen Muhammad Haggag Kenneth L. Hurley Eran Kampf
xiii
Preface
Brian Peltonen Mark Wang Additionally, the following ShaderX2 authors proofread several articles each: Dean Calver Nicolas Capens Tom Forsyth Shawn Hargreaves Jeffrey Kiel Hun Yen Kwoon Markus Nuebel Michal Valient Oliver Weichhold These great people spent a lot of time proofreading articles, proposing improvements, and exchanging e-mails with other authors and myself. Their support was essential to the book development process, and their work led to the high quality of the books. Thank you! Another big thank you goes to the people in the Microsoft Direct3D discussion group (http://DISCUSS.MICROSOFT.COM/ archives/DIRECTXDEV .html). They were very helpful in answering my numerous questions. As with Direct3D ShaderX, there were some driving spirits who encouraged me to start this project and hold on through the seven months it took to complete it: Dean Calver (Eclipse) Jason L. Mitchell (ATI Research) Natasha Tatarchuk (ATI Research) Nicolas Thibieroz (PowerVR) Carsten Wenzel (Crytek) Additionally, I have to thank Thomas Rued from DigitalArts for inviting me to the Vision Days in Copenhagen, Denmark, and for the great time I had there. I would like to thank Matthias Wloka and Randima Fernando from nVidia for lunch at GDC 2003. I had a great time.
xiv
Preface
As usual, the great team at Wordware made the whole project happen: Jim Hill, Wes Beckwith, Heather Hill, Beth Kohler, and Paula Price took over after I sent them hundreds of megabytes of data. There were other numerous people involved in this book project that I have not mentioned. I would like to thank them here. It was a pleasure working with so many talented people. Special thanks goes to my wife, Katja, and our daughter, Anna, who spent a lot of evenings and weekends during the last seven months without me, and to my parents, who always helped me to believe in my strength. Wolfgang F. Engel
P Plans for an upcoming project named ShaderX3 are already in .S.: progress. Any comments, proposals, and suggestions are highly welcome ([email protected]).
xv
xvii
software architecture to add programmable behaviors (and properties) to generic 3D objects, and lets them exist without a 2D window frame. Creator Gim Guan Chua is a freelance graphics programmer based in Singapore. He has been developing 3D applications for more than six years and likes to dabble in 3D modeling in his spare time. His web site is http://toybox.150m.com. Wolfgang F. Engel ([email protected]) Wolfgang is the editor and co-author of Direct3D ShaderX: Vertex and Pixel Shader Tips and Tricks, the author of Beginning Direct3D Game Programming, and a co-author of OS/2 in Team, for which he contributed the introductory chapters on OpenGL and DIVE. Wolfgang has written several articles in German journals on game programming and many online tutorials that were published on www.gamedev.net and his own web site, www.direct3d.net. During his career in the game industry he built up two game development units with four and five people that published six online games for the biggest European TV show, Wetten das..?. As a member of the board or as a CEO of different companies, he was responsible for several game projects. Hun Yen Kwoon ([email protected]) Hun Yen Kwoon is an electrical engineering graduate from the National University of Singapore. After spending 16 years in the education system, he decided he wanted to be a programmer more than an electrical engineer. He promptly joined an IT business solutions company and developed an online debit system for a local bank before realizing that Java is boring. He is now working as a software engineer with Silicon Illusions in Singapore. His work involves 3D visualization software engineering, SSE/SSE2, OpenGL, and Direct3D. Recently he has also been fiddling with game networking architecture and dead-reckoning techniques. What kind of work can be more exciting? Jason L. Mitchell ([email protected]) Jason is the team lead of the 3D Application Research Group at ATI Research, makers of the Radeon family of graphics processors. Working on the Microsoft campus in Redmond, Jason has worked with Microsoft for several years to define key new Direct3D
xviii
features. Prior to working at ATI, Jason did work in human eye tracking for human interface applications at the University of Cincinnati, where he received his masters degree in electrical engineering in 1996. He received a bachelors degree in computer engineering from Case Western Reserve University in 1994. In addition to this books article on HLSL programming and an article on advanced image processing for ShaderX2: Shader Programming Tips & Tricks with DirectX 9, Jason has written for the Game Programming Gems books, Game Developer magazine, Gamasutra.com, and academic publications on graphics and image processing. He regularly presents at graphics and game development conferences around the world. His home page can be found at http://www.pixelmaven.com/jason/. Markus Nuebel ([email protected]) Markus holds a masters degree in computer science and has been programming professionally for over eight years. Several years ago he discovered his passion for graphics and game programming. He has been into shader programming since nVidia launched cg and spends every free minute expanding his knowledge of interesting graphic programming algorithms. Craig Peeper ([email protected]) Craig Peeper is the lead developer for D3DX at Microsoft and has been on the team since DirectX 7. D3DX provides user-mode functionality for Direct3D, including mesh optimization, texture processing, and the High Level Shading Language compiler/runtime. Prior to his work on D3DX, Craig worked in Microsoft Graphics Research. Natasha Tatarchuk ([email protected]) Natasha Tatarchuk is a software engineer working in the 3D Application Research Group at ATI Research, where she is the programming lead for the RenderMonkey IDE project. She has been in the graphics industry for over six years, working on 3D modeling applications and scientific visualization prior to joining ATI. Natasha graduated from Boston University with a bachelors degree in
xix
computer science, a bachelors degree in mathematics, and a minor in visual arts. Nicolas Thibieroz ([email protected]) Like many kids of his generation, Nicolas Thibieroz discovered video games on the Atari VCS 2600. He quickly became fascinated by the mechanics behind those games, and started programming on the C64 and Amstrad CPC before moving on to the PC world. Nicolas realized the potential of real-time 3D graphics while playing Ultima Underworld. This game inspired him in such a way that both his school placement and final year projects were based on 3D computer graphics. After obtaining a bachelors degree in electronic engineering in 1996 he joined PowerVR Technologies where he is now responsible for developer relations. His duties include supporting game developers, writing test programs and demos, and generally keeping up to date with the latest 3D technology. Michal Valient ([email protected]) Michal received a degree in computer graphics at the Faculty of Mathematics, Physics and Informatics, Comenius University, Slovakia, in June 2003 after finishing his masters thesis about special effects for computer games. He is continuing with Ph.D. studies at the university. Previously he worked as director of development for a bigger company, but the call of real-time rendering was too strong and now he is fully concentrated in this area. Michal currently works for Caligari Corporation. His home page is at http://www.dimension3.host.sk.
xx
Introduction
This book is a collection of articles that explain the foundations of shader programming, from the High Level Shading Language and version 3.0 shader models to shadow mapping and stencil shadow volumes. The following provides a brief overview of these articles: Jason L. Mitchell and Craig Peeper, one of the creators of HLSL and the compiler, have written the best introduction to HLSL there is in Introduction to the DirectX High Level Shading Language. Because it comes from the official source, this article covers everything that an HLSL programmer needs and a lot more. The vs_3_0 and ps_3_0 shader models will be available in thirdgeneration shader graphics hardware. These shader versions are much more flexible and powerful than the previous versions, offering vertex texturing capabilities, predication, static and dynamic flow control, vertex stream frequency, and much more. Nicolas Thibieroz, Kristof Beets, and Aaron Burton from PowerVR have written an introduction to this shader model that explains every new feature and includes a source snippet. Michal Valients article Advanced Lighting and Shading with Direct3D 9 covers some more advanced lighting models including Phong, Oren-Nayar, and Cook-Torrance. He implements these algorithms with ps_1_4, ps_2_0, ps_3_0, and HLSL. This is the most extensive treatment of this topic available. There are several different ways to use fog to produce a specific mood in games. Markus Nuebel shows all possible ways to implement fog in a way that is easy to understand. The six example programs make using fog as easy as possible. Michal Valients second contribution is the article Shadow Mapping with Direct3D 9. With the release of DirectX 9 and its floating-point textures, using shadow maps for shadows leads to a
xxi
Introduction
much better visual experience. Michal shows how to implement shadow mapping in the most efficient and most flexible way and gives tips on how to debug an application. The most comprehensive treatment of shadow volumes available is contained in the article The Theory of Stencil Shadow Volumes by Hun Yen Kwoon. It covers every aspect of the various ways of programming shadow volumes. Six example programs give you a head start on implementing shadow volumes in minutes. ATIs RenderMonkey is a shader development tool that helps to reduce the workload of programmers and artists. One of its creators, Natalya Tatarchuk, explains how to use it and discusses its feature set. A topic that is seldom covered elsewhere is the necessity of creating geometric data in the art pipeline that is shader-friendly. Gim Guan Chua has written an article describing this task and provides a step-by-step explanation of how to do it.
xxii
Introduction
One of the most empowering new components of DirectX 9 is the High Level Shading Language (HLSL). Using this standard highlevel language, shader writers can think at the algorithm level while implementing shaders rather than worry about meddlesome hardware details, such as register allocation, register read-port limits, instruction co-issuing, and so on. In addition to freeing the developer from hardware details, the HLSL also has all of the usual advantages of a high-level language, such as easy code reuse, improved readability, and the presence of an optimizing compiler. Many of the chapters in this book and in ShaderX2: Shader Programming Tips & Tricks with DirectX 9 (also from Wordware Publishing) utilize shaders that are written in HLSL. As a result, it will be much easier for you to understand and work with those shaders after reading this introductory chapter. In this chapter, we outline the basic structure of the language itself, as well as strategies for integrating HLSL shaders into your application.
A Simple Example
Before presenting an exhaustive description of the HLSL, lets first have a look at one HLSL vertex shader and one HLSL pixel shader taken from an application that renders simple procedural wood. The first HLSL shader shown below is a simple vertex shader:
float4x4 view_proj_matrix; float4x4 texture_matrix0; struct VS_OUTPUT { float4 Pos : POSITION; float3 Pshade : TEXCOORD0; };
VS_OUTPUT main (float4 vPosition : POSITION) { VS_OUTPUT Out = (VS_OUTPUT) 0; // Transform position to clip space Out.Pos = mul (view_proj_matrix, vPosition); // Transform Pshade Out.Pshade = mul (texture_matrix0, vPosition); return Out; }
The first two lines of this shader declare a pair of 44 matrices called view_proj_matrix and texture_matrix0. Following these global-scope matrices, a structure is declared. This VS_OUTPUT structure has two members: a float4 called Pos and a float3 called Pshade. The main function for this shader takes a single float4 input parameter and returns a VS_OUTPUT structure. The float4 input vPosition is the sole input to the shader, while the returned VS_OUTPUT struct defines this vertex shaders output. For now, dont worry about the POSITION and TEXCOORD0 keywords following
these parameters and structure members. These are called semantics, and their meaning is discussed later in this chapter. Looking at the actual code body of the main function, you can see that an intrinsic function called mul is used to multiply the input vPosition vector by the view_proj_matrix matrix. This intrinsic is commonly used in vertex shaders to perform vectormatrix multiplication. In this case, vPosition is treated as a column vector, since it is the second parameter to mul. If the vPosition vector were the first parameter to mul, it would be treated as a row vector. (The mul intrinsic and other intrinsics are discussed in more detail later in the chapter.) Following the transformation of the input position vPosition to clip space, vPosition is multiplied by another matrix called texture_matrix0 to generate a 3D texture coordinate. The results of both of these transformations have been written to members of a VS_OUTPUT structure, which is returned. A vertex shader must always output a clipspace position at a minimum. Any additional values that are output from the vertex shader are interpolated across the rasterized polygon and available as inputs to the pixel shader. In this case, the 3D Pshade is passed from the vertex to the pixel shader via an interpolator. Below, we see a simple HLSL procedural wood pixel shader. This pixel shader, which is written to work with the vertex shader that we just described, will be compiled for the ps_2_0 target.
float4 lightWood; // xyz == Light Wood Color float4 darkWood; // xyz == Dark Wood Color float ringFreq; // ring frequency sampler PulseTrainSampler; float4 hlsl_rings (float4 Pshade : TEXCOORD0) : COLOR { float scaledDistFromZAxis = sqrt(dot(Pshade.xy, Pshade.xy)) * ringFreq; float blendFactor = tex1D (PulseTrainSampler, scaledDistFromZAxis); return lerp (darkWood, lightWood, blendFactor); }
The first few lines of this shader are the declaration of a pair of floating-point 4-tuples and one scalar float at global scope. Following these variables, a sampler called PulseTrainSampler is declared. Samplers are discussed in more detail later in the chapter, but for now you can just think of a sampler as a window into video memory with an associated state defining things like filtering and texture coordinate addressing modes. With variable and sampler declarations out of the way, we can move on to the body of the shader code. You can see that there is one input parameter called Pshade, which is interpolated across the polygon. This is the value that was computed at each vertex by the vertex shader above. In the pixel shader, the Cartesian distance from the shader-space z-axis is computed, scaled, and used as a 1D texture coordinate to access the texture bound to the PulseTrainSampler. The scalar color that is returned from the tex1D() sampling function is used as a blend factor to blend between the two constant colors (lightWood and darkWood) declared at the global scope of the shader. The 4D vector result of this blend is the final output of the pixel shader. All pixel shaders must return a 4D RGBA color at a minimum. We discuss additional optional pixel shader outputs later in the chapter.
human-readable assembly language code to the D3DX library via D3DXAssembleShader() and gets back a binary representation of the shader, which would in turn be passed to Direct3D via CreatePixelShader() or CreateVertexShader(). For more on the details of the legacy assembly shader models, please refer to the many resources available online and offline, including Direct3D ShaderX: Vertex and Pixel Shader Tips and Tricks and the DirectX SDK.
Figure 1: Use of D3DX for assembly and compilation in DirectX 8 and DirectX 9
As shown on the right side of Figure 1, the situation in DirectX 9 is very similar in that the application passes an HLSL shader to D3DX via the D3DXCompileShader() API and gets back a binary representation of the compiled shader, which is in turn passed to Direct3D via CreatePixelShader() or CreateVertexShader(). The binary asm code thats generated is only a function of the compile target chosen, not the specific graphics device in the users or developers system. That is, the binary asm that is generated is vendor-neutral and will be the same no matter where you compile or run it. In fact, the Direct3D runtime itself does not know anything about HLSL only the binary assembly shader models. This is nice because it means that the HLSL compiler can be updated independently of the Direct3D runtime. In fact, between press time and the release of the first printing of this book in late summer 2003, Microsoft plans to release a DirectX SDK update, which will contain an updated HLSL compiler.
In addition to the development of the HLSL compiler in D3DX, DirectX 9 also introduced additional assembly-level shader models to expose the functionality of the latest generation of 3D graphics hardware. Application developers can feel free to work directly in the assembly languages for these new models (vs_2_0, vs_3_0, ps_2_0, and ps_3_0), but we expect most developers to move wholesale to HLSL for shader development.
Hardware Realities
Of course, just because you can write an HLSL program to express a particular shading algorithm doesnt mean that it will run on a given piece of hardware. As we discussed earlier, an application calls D3DX to compile an HLSL shader to binary asm via the D3DXCompileShader() API. One of the parameters to this API entrypoint is a parameter that defines which of the assembly language models (or compile targets) the HLSL compiler should use to express the final shader code. If an application is doing HLSL shader compilation at run time (as opposed to offline), the application could examine the capabilities of the Direct3D device and select the compile target to match. If the algorithm expressed in the HLSL shader is too complex to execute on the selected compile target, compilation will fail. This means that while HLSL is a huge benefit to shader development, it does not free developers from the realities of shipping games to a target audience that owns graphics devices of varying capabilities. As a game developer, you still have to manage a tiered approach to your visuals, writing better shaders for better graphics cards and more basic versions for older cards. With well-written HLSL, however, this burden can be eased significantly.
Compilation Failure
As mentioned above, failure of a given HLSL shader to compile for a particular compile target is an indication that the shader is too complex for the compile target. This can mean that the shader either requires too many resources or it requires some capability,
such as dynamic branching, that is not supported by the chosen compile target. For example, an HLSL shader could be written to access a given texture map six times in a shader. If this shader is compiled for the ps_1_1 compile target, compilation will fail since the ps_1_1 model supports only four textures. Another common source of compilation failure is exceeding instruction count of the chosen compile target. An algorithm expressed in HLSL may simply require too many instructions to be executed by a given compile target. It is important to note that the choice of compile target does not restrict the HLSL syntax that a shader writer can use. For example, a shader writer can use for loops, subroutines, if-else statements, etc., and still compile for targets that dont natively support looping, branching, or if-else statements. In such cases, the compiler will unroll loops, inline function calls, and execute both branches of an if-else statement, selecting the proper result based upon the original value used in the if-else statement. Of course, if the resulting shader is too long or otherwise exceeds the resources of the compile target, compilation will fail.
machines capabilities at a more detailed level. These command-line options are summarized in the following table.
Command-line Option -T target -E name -Od -Vd -Zi -Zpr -Zpc -Fo file -Fc file -Fh file -D id = text -nologo Description compile target (default: vs_2_0) entrypoint name (default: main) disable optimizations disable validation enable debugging information pack matrices in row-major order pack matrices in column-major order output object file output listing of generated code output header containing generated code define macro suppress copyright message
Now that you understand the context in which the HLSL compiler can be used for shader development, lets discuss the actual mechanics of the language. As we progress, it is important to keep the notion of a compile target and the varying capabilities of the underlying assembly shader models in mind.
Language Basics
Now that you have a sense of what HLSL vertex and pixel shaders look like and how they interact with the low-level assembly shaders, we can discuss some of the details of the language itself.
Keywords
Keywords are predefined identifiers that are reserved for the HLSL language and cannot be used as identifiers in your program. Keywords marked with an asterisk (*) are case insensitive.
asm* decl* bool do compile double const else
The following keywords are currently unused but reserved for potential future use:
auto char const_cast dynamic_cast goto new public signed template typename virtual break class continue enum long operator register sizeof this union case compile default explicit mutable private reinterpret_cast static_cast throw unsigned catch const delete friend namespace protected short switch try using
Data Types
The HLSL has support for a variety of data types, from simple scalars to more complex types, such as vectors and matrices.
Scalar Types
The language supports the following scalar data types:
Data Type bool int half float double Representable Values true or false 32-bit signed integer 16-bit floating-point value 32-bit floating-point value 64-bit floating-point value
10
If you are already familiar with the assembly-level programming models, you should know that graphics processors do not currently have native support for all of these data types. As a result, integers may need to be emulated using floating-point hardware. This means that integer operations that go outside the range of integers that can be expressed as floats on these platforms are not guaranteed to function as expected. Additionally, not all target platforms have native support for half or double values. If the target platform does not, these will be emulated using float.
Vector Types
You will often find yourself declaring vector variables in your HLSL shaders. There are a variety of ways that these vectors can be declared, including the following:
Vector vector vector<type, size> Declared as A vector of dimension 4; each component is of type float. A vector of dimension size; each component is of scalar type type.
The most common way that you see shader authors declare vectors, however, is by using the name of a type followed by an integer from 2 to 4. To declare a 4-tuple of floats, for example, you could use any of the following vector declarations:
float4 float vector vector fVector0; fVector1[4]; fVector2; <float, 4> fVector3;
To declare a 3-tuple of bools, for example, you could use any of the following declarations:
bool3 bVector0; bool bVector1[3]; vector <bool, 3> bVector2;
11
Once you have defined a vector, you may access its individual components by using the array access syntax or a swizzle. In the swizzle case, the components must come from either the {x, y, z, w} or {r, g, b, a} namespace (but not both). For example:
float4 float float float float2 float2 pos = {3.0f, 5.0f, 2.0f, 1.0f}; value0 = pos[0]; // value0 is 3.0f value1 = pos.x; // value1 is 3.0f value2 = pos.g; // value2 is 5.0f vec0 = pos.xy; // vec0 is {3.0f, 5.0f} vec1 = pos.ry; // INVALID because of bad swizzle
It should be noted that the ps_2_0 and lower pixel shader models do not have native support for arbitrary swizzles. Hence, concise high-level code that uses swizzles can result in fairly nasty binary asm when compiling to these targets. You should familiarize yourself with the native swizzles available in these assembly models.
Matrix Types
Another very common type of variable that you will find yourself using in HLSL shaders is matrices, which are 2D arrays of data. Like scalars and vectors, matrices may be composed of any of the basic data types: bool, int, half, float, or double. Matrices may be of any size, but you will typically find shader writers using matrices with up to four rows and columns. Recall that the example vertex shader shown at the beginning of the chapter declared two 44 float matrices at global scope:
float4x4 view_proj_matrix; float4x4 texture_matrix0;
Naturally, other dimensions of matrices can be used. For example, we could declare a floating-point matrix with three rows and four columns in a variety of ways:
float3x4 mat0; matrix<float, 3, 4> mat1;
Like vectors, the individual elements of matrices can be accessed using array or structure/swizzle syntax. For example, the
12
following array indexing syntax can be used to access the top-left element of the matrix view_proj_matrix:
float fValue = view_proj_matrix[0][0];
There is also a structure syntax defined for access to and swizzling of matrix elements. For zero-based row-column position, you can use any of the following: _m00, _m01, _m02, _m03 _m10, _m11, _m12, _m13 _m20, _m21, _m22, _m23 _m30, _m31, _m32, _m33 For one-based row-column position, you can use any of the following: _11, _12, _13, _14 _21, _22, _23, _24 _31, _32, _33, _34 _41, _42, _43, _44 Matrices can also be accessed using array notation. For example:
float2x2 fMat = {3.0f, 5.0f, // row 1 2.0f, 1.0f}; // row 2 float float float float float2 float2 value0 value1 value2 value3 vec0 vec1 = = = = = = fMat[0]; fMat._m00; fMat._12 fMat[1][1] fMat._21_22; fMat[1]; // // // // // // value0 is 3.0f value1 is 3.0f value2 is 5.0f value3 is 1.0f vec0 is {2.0f, 1.0f} vec1 is {2.0f, 1.0f}
Type Modifiers
There are a couple of optional type modifiers in the HLSL that you may want to use in your shaders. The familiar const type modifier is used to specify a variable whose value cannot be changed by the shader code. Using such a variable on the left side of an assignment (i.e., as an lval) will result in a compilation error.
13
The row_major and col_major type modifiers can be used to specify the expected layout of a matrix within the hardware constant store. The row_major type modifier indicates that each row of the matrix will be stored in a single constant register. Likewise, using col_major indicates that each column of the matrix will be stored in a single constant register. Column major is the default.
14
extern float translucencyCoeff; const float gloss_bias; static float gloss_scale; float diffuse;
The variables diffuse and translucencyCoeff are settable by the Set*ShaderConstant*() API and can be modified by the shader itself. The const variable gloss_bias is settable by the Set*ShaderConstant*() API but cannot be modified in the shader code. Finally, the static variable gloss_scale is not settable by the Set*ShaderConstant*() API but can be modified within the shader only.
Initializers
As we have shown in some of the preceding examples, it is possible to initialize variables at declaration time in the same manner used in C. For example:
float2x2 fMat = {3.0f, 5.0f, // row 1 2.0f, 1.0f}; // row 2 float4 vPos = {3.0f, 5.0f, 2.0f, 1.0f}; float fFactor = 0.2f;
Assuming vBrightness and vExposure are both of type float4, this is equivalent to:
float4 vTone; vTone.x = vBrightness.x vTone.y = vBrightness.y vTone.z = vBrightness.z vTone.w = vBrightness.w * * * * vExposure.x; vExposure.y; vExposure.z; vExposure.w;
15
Note that this is not a dot product between the 4D vectors vBrightness and vExposure. Additionally, multiplying matrix variables in this way does not result in a matrix multiply. Dot products and matrix multiplies are applied via the intrinsic function mul(), which we discuss later in the chapter.
Constructors
Another language feature that you often see in HLSL shaders is the constructor, which is similar to C++ but has some enhancements to deal with complex data types. Example uses of constructors include:
float3 float float4 vPos = float3(4.0f, 1.0f, 2.0f); fDiffuse = dot(vNormal, float3(1.0f, 0.0f, 0.0f)); vPack = float4(vPos, fDiffuse);
Constructors are commonly used when a shader writer wants to temporarily define a quantity with literal values (as in dot(vNormal, float3(1.0f, 0.0f, 0.0f)) above) or when a shader writer wants to explicitly pack smaller data types together (as in float4(vPos, fDiffuse) above). In this case, the float4 constructor takes in a float3 and a float and returns a float4 with the data packed together.
Type Casting
To aid in shader writing and the efficiency of the generated code, it is a good idea to be familiar with HLSL type casting behavior. s Type casting often happens in order to promote or demote a given variable to match a variable to which it is being assigned. For example, in the following case, a literal float 0.0f is being cast to a float4 {0.0f , 0.0f , 0.0f , 0.0f } to initialize vResult.
float4 vResult = 0.0f;
Similar casting can occur when assigning a higher dimensional data type like a vector or matrix to a lower dimensional data type. In these cases, the extra data is effectively omitted. For example, we may write the following code:
16
In this case, vLight is cast to a float by using only the first component in the multiply with the scalar float fColor. In this case, fFinal is equal to vLight.x * fColor. It is a good idea to be familiar with the following table of type casting rules for HLSL:
Type of Cast Scalar-to-scalar Casting Behavior Always valid. When casting from bool type to an integer or floating-point type, false is considered to be zero and true is considered to be one. When casting from an integer or floating-point type to bool, a zero value is considered to be false and a nonzero value is considered to be true. When casting from a floating-point type to an integer type, the value is rounded toward zero. This is the same truncation behavior as in C. Always valid. This cast operates by replicating the scalar to fill the vector. Always valid. This cast operates by replicating the scalar to fill the matrix. This cast operates by replicating the scalar to fill the structure. Always valid. This selects the first component of the vector. The destination vector must not be larger than the source vector. The cast operates by keeping the leftmost values and truncating the rest. For the purposes of this cast, column matrices, row matrices, and numeric structures are treated as vectors. The size of the vector must be equal to the size of the matrix. This is valid if the structure is not larger than the vector, and all components of the structure are numeric. Always valid. This selects the upper-left component of the matrix. The size of the matrix must be equal to the size of the vector. The destination matrix must not be larger than the source matrix in both dimensions. The cast operates by keeping the upper-left values and truncating the rest. The size of the structure must be equal to the size of the matrix, and all components of the structure are numeric. The structure must contain at least one member.
Matrix-to-structure Structure-to-scalar
17
Casting Behavior The structure must be at least the size of the vector. The first components must be numeric, up to the size of the vector. The structure must be at least the size of the matrix. The first components must be numeric, up to the size of the matrix. The structure must contain at least one member. The type of this member must be identical to the type of the object. The destination structure must not be larger than the source structure. A valid cast must exist between all respective source and destination components.
Structures
As we showed in the first example shader, it is often convenient to be able to define structures in HLSL shaders. For example, many shader writers will define an output structure in their vertex shader code and use this structure as the return type from their vertex shaders main function. (It is less common to do this with a pixel shader since most pixel shaders have only one float4 output.) An example structure taken from the NPR Metallic shader that we discuss later is shown below:
struct VS_OUTPUT { float4 Pos : float3 View : float3 Normal: float3 Light1: float3 Light2: float3 Light3: };
Structures may be declared for general use in an HLSL shader as well. They follow the type casting rules outlined above.
Samplers
For each different texture map that you plan to sample in a pixel shader, you must declare a sampler. Recall the hlsl_rings() shader described earlier:
18
float4 lightWood; // xyz == Light Wood Color float4 darkWood; // xyz == Dark Wood Color float ringFreq; // ring frequency sampler PulseTrainSampler; float4 hlsl_rings (float4 Pshade : TEXCOORD0) : COLOR { float scaledDistFromZAxis = sqrt(dot(Pshade.xy, Pshade.xy)) * ringFreq; float blendFactor = tex1D (PulseTrainSampler, scaledDistFromZAxis); return lerp (darkWood, lightWood, blendFactor); }
In this shader, we declared a sampler called PulseTrainSampler at global scope and passed it as the first parameter to the tex1D() intrinsic function (we discuss intrinsics in the next section). An HLSL sampler has a very direct mapping to the API concept of a sampler and, in turn, to the actual silicon in the 3D graphics processor, which is responsible for addressing and filtering textures. A sampler must be defined for every texture map that you plan to access in a given shader, but you may use a given sampler multiple times in a shader. This usage is very common in image processing applications, as discussed in ShaderX2: Shader Programming Tips & Tricks with DirectX 9, since the input image is often sampled multiple times with different texture coordinates to provide data to a filter kernel expressed in shader code. For example, the following shader uses the rasterizer to convert a height map to a normal map with a pair of Sobel filters:
sampler InputImage; float4 main( float2 float2 float2 float2 { // Take all float4 tl = float4 l = float4 bl = eight tex2D tex2D tex2D taps (InputImage, topLeft); (InputImage, left); (InputImage, bottomLeft); topLeft bottomLeft bottom right : : : : TEXCOORD0, TEXCOORD2, TEXCOORD4, TEXCOORD6, float2 float2 float2 float2 left top topRight bottomRight : : : : TEXCOORD1, TEXCOORD3, TEXCOORD5, TEXCOORD7): COLOR
19
t b tr r br
= = = = =
// Compute dx using Sobel operator: // // -1 0 1 // -2 0 2 // -1 0 1 float dX = -tl.a - 2.0f*l.a - bl.a + tr.a + 2.0f*r.a + br.a; // Compute dy using Sobel operator: // // -1 -2 -1 // 0 0 0 // 1 2 1 float dY = -tl.a - 2.0f*t.a - tr.a + bl.a + 2.0f*b.a + br.a; // Compute cross product and renormalize float4 N = float4(normalize(float3(-dX, -dY, 1)), tl.a); // Convert signed values from -1..1 to 0..1 range and return return N * 0.5f + 0.5f; }
This shader uses only one sampler, InputImage, but samples from it eight times using the tex2D() intrinsic function.
Intrinsics
As mentioned in the preceding section, there are a number of intrinsics built into the DirectX High Level Shading Language for your convenience. Many intrinsics, such as mathematical functions, are provided for convenience, while others, such as the tex1D() and tex2D() functions mentioned above, are necessary for accessing texture data via samplers.
20
Math Intrinsics
The math intrinsics listed in the table below will be converted to micro operations by the HLSL compiler. In some cases, such as abs() and dot(), these intrinsics will map directly to single assembly-level operations, while in other cases, such as refract() and step(), they will map to multiple assembly instructions. There are even a couple of cases, notably ddx(), ddy(), and fwidth(), that are not supported for all compile targets. The math intrinsics are shown below:
Intrinsic abs(x) acos(x) all(x) any(x) Description Absolute value (per component). Returns the arccosine of each component of x. Each component should be in the range [1, 1]. Tests if all components of x are nonzero. Tests if any component of x is nonzero. Returns the arcsine of each component of x. Each component should be in the range [p/2, p/2]. Returns the arctangent of x. The return values are in the range [p/2, p/2]. Returns the arctangent of y/x. The signs of y and x are used to determine the quadrant of the return values in the range [p, p]. atan2 is well-defined for every point other than the origin, even if x equals 0 and y does not equal 0. Returns the smallest integer that is greater than or equal to x. Clamps x to the range [min, max]. Discards the current pixel, if any component of x is less than 0. This can be used to simulate clip planes, if each component of x represents the distance from a plane. This is the intrinsic that you use when you want to generate an asm texkill. Returns the cosine of x. Returns the hyperbolic cosine of x. Returns the cross product of two 3D vectors a and b.
asin(x)
atan(x) atan2(y, x)
D3DCOLORtoUBYTE4(x) Swizzles and scales components of the 4D vector x to compensate for the lack of UBYTE4 stream component support in some hardware. ddx(x) Returns the partial derivative of x with respect to the screen-space x-coordinate.
21
Intrinsic ddy(x) degrees(x) determinant(m) distance(a, b) dot(a, b) exp(x) exp2(a) faceforward(n, i, ng) floor(x) fmod(a, b)
Description Returns the partial derivative of x with respect to the screen-space y-coordinate. Converts x from radians to degrees. Returns the determinant of the square matrix m. Returns the distance between two points a and b. Returns the dot product of two vectors a and b. Returns the base-e exponent ex. Base-2 exponent (per component). Returns n * sign(dot(i, ng)). Returns the greatest integer that is less than or equal to x. Returns the floating-point remainder f of a / b such that a = i * b + f, where i is an integer, f has the same sign as x, and the absolute value of f is less than the absolute value of b. Returns the fractional part f of x, such that f is a value greater than or equal to 0 and less than 1. Returns the mantissa and exponent of x. frexp returns the mantissa, and the exponent is stored in the output parameter exp. If x is 0, the function returns 0 for both the mantissa and the exponent. Returns abs(ddx(x))+abs(ddy(x)). Returns true if x is finite; false otherwise. Returns true if x is +INF or INF; false otherwise. Returns true if x is NAN or QNAN; false otherwise. Returns x * 2exp. Vector length. Returns the length of the vector v. Returns a + s(b a). This linearly interpolates between a and b, such that the return value is a when s is 0 and b when s is 1. Returns the base-e logarithm of x. If x is negative, the function returns indefinite. If x is 0, the function returns +INF . Returns the base-10 logarithm of x. If x is negative, the function returns indefinite. If x is 0, the function returns +INF . Returns the base-2 logarithm of x. If x is negative, the function returns indefinite. If x is 0, the function returns +INF . Selects the greater of a and b.
log(x)
log10(x)
log2(x)
max(a, b)
22
Intrinsic min(a, b)
Description Selects the lesser of a and b. Splits the value x into fractional and integer parts, each of which has the same sign as x. The signed fractional portion of x is returned. The integer portion is stored in the output parameter ip. Performs matrix multiplication between a and b. If a is a vector, it is treated as a row vector. If b is a vector, it is treated as a column vector. The inner dimension acolumns and brows must be equal. The result has the dimension arows bcolumns. Returns the normalized vector v / length(v). If the length of v is 0, the result is indefinite. Returns xy. Converts x from degrees to radians. Returns the reflection vector v, given the entering ray direction i and the surface normal n, such that v = i 2 * dot(i, n) * n. Returns the refraction vector v, given the entering ray direction i, the surface normal n, and the relative index of refraction eta. If the angle between i and n is too great for a given eta, refract returns (0,0,0). Rounds x to the nearest integer. Returns 1 / sqrt(x). Clamps x to the range [0, 1]. Computes the sign of x. Returns 1 if x is less than 0, 0 if x equals 0, and 1 if x is greater than 0. Returns the sine of x. Returns the sine and cosine of x. sin(x) is stored in the output parameter s. cos(x) is stored in the output parameter c. Returns the hyperbolic sine of x.
mul(a, b)
refract(i, n, eta)
sinh(x)
smoothstep(min, max, x) Returns 0 if x < min. Returns 1 if x > max. Returns a smooth Hermite interpolation between 0 and 1 if x is in the range [min, max]. sqrt(x) step(a, x) tan(x) tanh(x) transpose(m) Square root (per component). Returns (x = a) ? 1 : 0. Returns the tangent of x. Returns the hyperbolic tangent of x. Returns the transpose of the matrix m. If the source is dimension mrows mcolumns, the result is dimension mcolumns mrows.
23
tex3Dbias(s, t) texCUBE(s, t)
texCUBE(s, t, ddx, ddy) Cube map lookup, with derivatives. s is a sampler. t, ddx, and ddy are 3D vectors. texCUBEproj(s, t) Projective cube map lookup. s is a sampler. t is a 4D vector. t is divided by its last component before the lookup takes place.
24
Intrinsic
Description Biased cube map lookup. s is a sampler. t is a 4D vector. The mip level is biased by t.w before the lookup takes place.
texCUBEbias(s, t)
The tex1D(), tex2D(), tex3D(), and texCUBE() intrinsics are the most commonly used to sample textures. The texture loading intrinsics that take ddx and ddy parameters compute texture LOD using these explicit derivatives, which would typically have been previously calculated with the ddx() and ddy() math intrinsics. These are particularly important when writing procedural pixel shaders, but they are not supported on ps_2_0 or lower compile targets. The tex*proj() intrinsics are used to do projective texture reads, where the texture coordinates used to sample the texture are divided by the last component prior to accessing the texture. Of these, tex2Dproj() is the most commonly used, since it is necessary for projective shadow maps and similar effects. The tex*bias() intrinsics are used to perform biased texture sampling, where the bias can be computed per pixel. This is typically done to induce some over-blurring of the texture for a special effect. For example, as discussed in ShaderX2: Shader Programming Tips & Tricks with DirectX 9, the pixel shader used on the motion-blurred balls in the Radeon 9700 Animusic Pipe Dream demo uses the texCUBEbias() intrinsic to access the cubic environment map of the local scene:
... // Blur reflection by extension amount. float3 vCubeLookup = vReflection + i.Pos/fEnvMapRadius; float4 cReflection = texCUBEbias(tCubeEnv, float4(vCubeLookup, fBlur * fTextureBlur)) * vReflectionColor; ...
In this code snippet, fBlur * fTextureBlur is stored in the fourth component of the texture coordinate used in the texCUBEbias() call and determines the bias to be used when accessing the cube map.
25
Now that we have introduced some of the mechanics of the language, we can discuss how data is input to and output from HLSL shaders in DirectX 9.
Shader Inputs
Vertex and pixel shaders have two types of input data: varying and uniform. The varying input is the data that is unique to each execution of a shader. For a vertex shader, the varying data (i.e., position, normals, etc.) comes from the vertex streams. The uniform data (i.e., material color, world transform, etc.) is constant for multiple executions of a shader. If you are familiar with the assembly models, uniform data is specified in constant registers and varying data in the v/t registers in vertex and pixel shaders.
Uniform Input
Uniform data can be specified by two methods in HLSL. The most common method is to declare global variables and use them within the vertex or pixel shaders. Any use of a global variable within a shader will result in the addition of the variable to a list of uniform variables required by the shader. The second method is to mark an input parameter of the top-level shader function as uniform. This marking specifies that the given variable should be added to the list of uniform variables used by the shader. Both of these methods are illustrated in the following code snippet:
// Declare a global uniform variable // Appears in constant table under name 'UniformGlobal' float4 UniformGlobal; // Declare a uniform input parameter // Appears in constant table under name '$UniformParam' float4 main( uniform float4 UniformParam ) : POSITION { return UniformGlobal * UniformParam; }
26
The uniform variables used by a shader are communicated back to the application via the constant table. The constant table is a symbol table that defines how the uniform variables used by a shader must be loaded into the constant registers prior to shader execution.
NOTE The uniform input function parameters appear in the constant table with a $ prepended, unlike the global variables. The $ is required to avoid name collisions between local uniform inputs and global variables of the same name.
The constant table contains the constant register locations of all uniform variables used by the shader. The table also includes the type information and the default value, if specified, for each constant table entry. The following is an example of what a constant table looks like when printed out. The constant table generated by the compiler is stored in a compact binary form. The API to interpret the table at run time will be discussed later in the section on HLSL integration without the use of D3DX Effects. Here is the textual printout of a constant table emitted by fxc.exe for a sample shader:
// // Generated by Microsoft (R) D3DX9 Shader Compiler // // Source: hemisphere.fx // Flags: /E:VS /T:vs_1_1 // // Registers: // // Name // -----------// Projection // WorldView // DirFromLight // DirFromSky // $bHemi // $bDiff // $bSpec //
27
Varying Input
Varying data is specified by marking the input parameters of the top-level shader function with an input semantic. All top-level shader inputs must either be marked as varying by using semantics or marked with the keyword uniform to indicate the value is constant for the execution of the shader. If a top-level shader input is not marked with a semantic or uniform keyword, the shader will fail to compile. The input semantic is a name used to link the given shader input to an output of the previous stage of the graphics pipeline. For example, the input semantic POSITION0 is used by vertex shaders to specify where the position data from the vertex buffer should be linked. Pixel and vertex shaders have different sets of input semantics due to the different parts of the graphics pipeline that feed into each shader unit. Vertex shader input semantics describe the per-vertex information to be loaded from a vertex buffer into a form that can be consumed by the vertex shader (i.e., positions, normals, texture coordinates, colors, tangents, binormals, etc.). These input semantics directly map to the combination of the D3DDECLUSAGE enum and UsageIndex that is used to describe vertex data elements in a vertex buffer. Pixel shader input semantics describe the information that is provided per pixel by the rasterization unit. This data is generated by interpolating between the outputs of the vertex shader for each vertex of the current primitive. The basic pixel shader input semantics link the input color and texture coordinate information to input parameters.
28
Input semantics can be assigned to shader input by two methods. The first method is by appending a colon (:) and the input semantic name to the input parameter declaration. The second method is to define an input structure with input semantics assigned to each element of the input structure. Both of these styles are used in the example shaders in this chapter and throughout the ShaderX books. Here is an input semantic example:
// Declare an input structure with a semantic binding struct InStruct { float4 Pos1 : POSITION1 }; // Declare the Pos variable as containing position data float4 main( float4 Pos : POSITION0, InStruct In ) : POSITION { return Pos * In.Pos1; } // Declare the Col variable as containing the interpolated COLOR0 value float4 mainPS( float4 Col : COLOR0 ) : COLOR { return Col; }
29
Shader Outputs
Vertex and pixel shaders provide output data to the subsequent graphics pipeline stage. Output semantics are used to specify how data generated by the shader should be linked to the inputs of the next stage. For example, the output semantics for a vertex shader are used to link the outputs with the interpolators in the rasterizer to generate the input data for the pixel shader. The pixel shader outputs are the values provided to the alpha blending unit for each of the render targets or the depth value to be written to the depth buffer. Vertex shader output semantics are used to link the shader to both the pixel shader and the rasterizer stage. The POSITION output is a required output from each vertex shader that is consumed by the rasterizer and not exposed to the pixel shader. TEXCOORDn and COLORn denote outputs that are made available to the pixel shader post interpolation. Pixel shader output semantics bind the output colors of a pixel shader with the correct render target. The colors output from the pixel shader are linked to the alpha blend stage, which determines how the destination render targets are modified. The DEPTH output semantics can be used to change the destination depth value at the current raster location.
NOTE DEPTH and multiple render targets (also known as MRT) are only supported with some shader models.
The syntax for output semantics is identical to the syntax for specifying input semantics. The semantics can either be specified
30
directly on parameters declared as out parameters or assigned during the definition of a structure that is either returned as an out parameter or the return value of the function. Here are the vertex shader output semantics:
Semantic POSITION PSIZE FOG COLORn TEXCOORDn Description Position Point size Vertex fog Color (example: COLOR0) Texture coordinates (example: TEXCOORD0)
n is an optional integer (as an example: TEXCOORD3, COLOR0). The following code snippets illustrate the variety of ways in which data can be output from HLSL shaders:
// Declare an output structure with a semantic binding struct OutStruct { float2 Tex2 : TEXCOORD2 }; // Declare the Tex0 out parameter as containing TEXCOORD0 data float4 main(out float2 Tex0 : TEXCOORD0, out OutStruct Out ) : POSITION { Tex0 = float2(1.0, 0.0); Out.Tex2 = float2(0.1, 0.2); return float4(0.5, 0.5, 0.5, 1); } // Declare the Col variable as containing the interpolated COLOR0 value float4 mainPS( out float4 Col1 : COLOR1) : COLOR { // write out to render target 1 using out parameter Col1 = float4(0.0, 0.0, 0.0, 0.0);
31
// write to render target 0 using the declared return destination return float4(1.0, 0.9722, 0.3333334, 0); }
struct PS_OUT { float4 Color: COLOR; float Depth: DEPTH; }; // // Three different ways to output from a pixel shader: // PS_OUT PSFunc1() { ... } void PSFunc2(out float4 Color : COLOR, out float Depth : DEPTH) { ... } void PSFunc3(out PS_OUT Out) { ... }
An Example Shader
Now that weve discussed the language itself and how it connects with the rest of the graphics pipeline via inputs and outputs, we can discuss an example shader called NPR Metallic. We call it this since it was designed to look like a metallic surface that would exist in a world rendered in a cel-animation style (see Figure 2). This effect ships with the RenderMonkey shader development environment discussed in the Shader Development Using RenderMonkey article in this book and is available on the ATI Developer Relations web site (www.ati.com/developer).
32
First, lets look at the NPR Metallic vertex shader written in HLSL:
float4x4 view_proj_matrix; float4 float4 float4 float4 view_position; light0; light1; light2;
struct VS_OUTPUT { float4 Pos : float3 View : float3 Normal: float3 Light1: float3 Light2: float3 Light3: };
VS_OUTPUT main( float4 inPos : POSITION, float3 inNorm : NORMAL ) { VS_OUTPUT Out = (VS_OUTPUT) 0;
33
// Output transformed vertex position: Out.Pos = mul(view_proj_matrix, inPos); Out.Normal = inNorm; // Compute the view vector: Out.View = normalize(view_position - inPos); // Compute Out.Light1 Out.Light2 Out.Light3 vectors to three lights from the current vertex position: = normalize(light0 - inPos); // Light 1 = normalize(light1 - inPos); // Light 2 = normalize(light2 - inPos); // Light 3
return Out; }
The first thing that we see in this vertex shader is the declaration of a matrix and a set of floats at global scope: view_proj_matrix, view_position, light0, light1, and light2. These are all implicitly uniform variables that are externally settable by the API and modifiable in the shader itself. Following these global variables, we see the definition of a structure called VS_OUTPUT, which is also the return type of our main function. This means that this vertex shader will output five 3D texture coordinates in addition to the required 4D position. Looking at the main function, we can see that the vertex shader takes a 4D vector as input position, a 3D vector as input normal, and a 2D vector as a texture coordinate. The input position, inPos, is transformed by the view_proj_matrix using the mul() intrinsic, while the normal, inNorm, is passed through to the output untouched. Finally, 3D vectors from the object space vertex position to the three lights and the view position are all computed. These 3D vectors are passed to the normalize() intrinsic to guarantee that they are of unit length. These normalized 3D vectors are all output from the vertex shader as 3D texture coordinates that will be interpolated across the polygon. To reinforce the earlier discussion about compile targets and assembly models, lets compile this shader and have a look at the assembly output. First, we write the above code into a file called
34
Because this vertex shader does not require flow control, we select the vs_1_1 compile target. We also set the flags to generate a code file and disable validation. A portion of the generated code file is shown here:
// Parameters: // float4 light0; // float4 light1; // float4 light2; // float4 view_position; // float4x4 view_proj_matrix; // // Registers: // Name Reg Size // ---------------- ----- ---// view_proj_matrix c0 4 // view_position c4 1 // light1 c5 1 // light2 c6 1 // light0 c7 1 vs_1_1 dcl_position v0 dcl_normal v1 mul r0, v0.x, c0 mad r2, v0.y, c1, r0 mad r4, v0.z, c2, r2 mad oPos, v0.w, c3, r4 add r1, -v0, c4 dp4 r1.w, r1, r1 rsq r1.w, r1.w mul oT0.xyz, r1, r1.w add r8, -v0, c7 dp4 r8.w, r8, r8 rsq r8.w, r8.w mul oT2.xyz, r8, r8.w add r3, -v0, c5 add r10, -v0, c6
35
r3.w, r3, r3 r3.w, r3.w oT3.xyz, r3, r3.w r10.w, r10, r10 r10.w, r10.w oT4.xyz, r10, r10.w oT1.xyz, v1
At the top of the code file, we see the parameters to this vertex shader. That is, we see the global scope variables that will need to be set from the API for this shader to work properly in a given application. The next section shows the hardware registers to which these parameters must be loaded by the application for the assembly shader to work properly. Next, we have the shader code itself, which was compiled to 21 assembly instructions. We dont go through all of the code, but you should take note of the dcl_position and dcl_normal statements, which are a direct result of the POSITION and NORMAL semantics on the inputs to the shaders main function. Additionally, note the storage of final results in the oPos, oT0, oT1, oT2, oT3, and oT4 registers. This is caused by the return type of the function being a structure whose members are tagged with the corresponding semantics. While not strictly necessary, knowing how to use fxc to generate assembly code from HLSL and how to read through it can be beneficial at some stages of development, particularly when trying to write more optimal HLSL. Now that we have used the vertex shader to transform the geometry into clip space and define the values that will be interpolated across the polygons, we can move on to the pixel shader, which will make use of all of these interpolated quantities. The following is the NPR Metallic pixel shader:
float4 Material; sampler Outline; float4 main( float3 float3 float3 float3 View: Normal: Light1: Light2: TEXCOORD0, TEXCOORD1, TEXCOORD2, TEXCOORD3,
36
float3 Light3: TEXCOORD4 ) : COLOR { // Normalize input normal vector: float3 norm = normalize (Normal); float4 outline = tex1D(Outline, 1 - dot (norm, normalize(View))); float lighting = (dot (normalize (Light1), norm) * 0.5 + 0.5) + (dot (normalize (Light2), norm) * 0.5 + 0.5) + (dot (normalize (Light3), norm) * 0.5 + 0.5); return outline * Material * lighting; }
As before, we see that this shader has declared some variables at global scope. In this case, we have a 4D vector Material, which defines material values for the object to be rendered, and a single sampler Outline, which we use to access a special texture used for outlining the object. The five 3D texture coordinates computed in the vertex shader are the inputs to the main function of this pixel shader and define the view vector, the normal vector, and three light vectors. Since the texture coordinates are linearly interpolated across the polygon, it is possible for them to contain non-normalized values at a given pixel. Thus, this shader first renormalizes the interpolated normal vector using the normalize() intrinsic. Subsequently, the outline texture is sampled using the dot product of the normalized normal and view vectors. The lighting is then computed by summing a series of scaled and biased dot products of the normal with normalized light vectors. In the last line of this pixel shader, we return the product of the variables outline, Material, and lighting. The first two of these are 4D vectors, while the last is a scalar. If you recall from our earlier discussion of type casting, the multiplication of the scalar by a vector temporarily promotes the scalar to a vector whose components are all equivalent to the original scalar. That is, the following two expressions are equivalent:
return outline * Material * lighting; return outline * Material * float4(lighting, lighting, lighting, lighting);
37
Thus, the end result is that all of the channels are multiplied by the scalar lighting, giving us the final result you see in Figure 2. As we did with the NPR Metallic vertex shader, we generate a code file for the pixel shader using fxc:
fxc -nologo -T ps_2_0 -Fc -Vd NPRMetallic.phl
This compilation uses the same flags as before but is compiled for the ps_2_0 target. The resulting 29-instruction shader is shown below:
// Parameters: // float4 Material; // sampler Outline; // // Registers: // Name Reg Size // ------------ ----- ---// Material c0 1 // Outline s0 1 ps_2_0 def c1, 1, 0, 0, 0.5 dcl t0.xyz dcl t1.xyz dcl t2.xyz dcl t3.xyz dcl t4.xyz dcl_2d s0 dp3 r0.w, t1, t1 rsq r2.w, r0.w mul r9.xyz, r2.w, t1 dp3 r9.w, t0, t0 rsq r9.w, r9.w mul r4.xyz, r9.w, t0 dp3 r9.w, r9, r4 add r11.xy, -r9.w, c1.x texld r6, r11, s0 dp3 r9.w, t2, t2 rsq r9.w, r9.w mul r1.xyz, r9.w, t2 dp3 r9.w, r1, r9 mad r9.w, r9.w, c1.w, c1.w
38
dp3 rsq mul dp3 mad add dp3 rsq mul dp3 mad add mul mul mov
r8.w, t3, t3 r10.w, r8.w r5.xyz, r10.w, t3 r0.w, r5, r9 r9.w, r0.w, c1.w, r9.w r9.w, r9.w, c1.w r2.w, t4, t4 r11.w, r2.w r1.xyz, r11.w, t4 r8.w, r1, r9 r10.w, r8.w, c1.w, r9.w r5.w, r10.w, c1.w r6, r6, r5.w r0, r6, c0 oC0, r0
As before, the variables (in this case, the constant Material and the sampler Outline) are listed at the top of the file. These must be set properly by the application via the API in order for the shader to function correctly. After the ps_2_0 instruction, there is a def instruction of some magic constants. This def instruction is a free instruction that appears in the actual assembly instruction stream that defines constants that will be used by the subsequent ALU operations. This kind of constant definition is generally the result of literal values appearing in the HLSL shader, as in the following statements taken from the NPR Metallic pixel shader:
... 1 - dot (norm, normalize(View) ... dot (normalize (Light1), norm) * 0.5 + 0.5 ...
Following this constant definition, there are five 3D texture coordinate declarations of the form dcl tn.xyz. As in the vertex shader, these are a result of the semantics of the input parameters to this HLSL shaders main function. Following the texture coordinate declarations, there is a sampler declaration dcl_2d s0. This indicates that a 2D texture must be bound to sampler zero. This may seem odd since the tex1D() intrinsic was used in the
39
HLSL shader. This discrepancy exists since there is no such thing as a 1D texture in the Direct3D API or shader assembly language. The tex1D() intrinsic is actually just a way for the HLSL shader writer to indicate to the compiler that only one component of the texture coordinate needs to be populated, shaving off an assembly instruction in some cases. Now that you are familiar with some of the correspondence between HLSL and assembly code, we can discuss optimization strategies so that you can be sure that you are writing the best HLSL possible.
Optimization
While the DirectX HLSL compiler has an excellent optimizer built in, there are things that you can do as an HLSL coder to help shave off a few more cycles here and there. While this is probably more of an academic exercise in the long term, it may or may not make the difference between being able to target a legacy 1.x shader model using HLSL. The most important thing to remember about writing highperformance shaders is that the compiler is required to do what you ask it to. That is, if you write your shader to require a certain number of math operations or a particular value in an output component, it needs to perform those operations. The compiler is smart about removing dead code, but it cannot know about values that do not ultimately matter due to circumstances outside of a given shader. For example, if the pixel shader is not using the second texture coordinate, the vertex shader probably shouldnt compute it. The HLSL compiler, of course, has no way of knowing this when you compile the vertex shader. Additionally, you may know that you will always use an n1 function lookup texture at a given sampler, and hence it is not necessary to compute the second texture coordinate for use in the sampling intrinsic. If you use the tex2D() intrinsic, however, the HLSL compiler requires you to compute the second texture coordinate even though it is ultimately unnecessary. The compiler is designed to build an
40
assembly program that does exactly what you asked without making any visual quality versus performance trade-offs. Another extremely important objective for high-performance shaders is to make sure that a computation only runs at the required frequency. If you can get away with doing a calculation per vertex rather than per pixel, then do so. The biggest wins often come from these types of operations. The same optimization is true for operations on values that are uniform (i.e., operations that do not change for the entire execution of the shader). An example of this would be pre-multiplying the world ambient color value by an objects material ambient value and passing their product to the shader instead of redundantly calculating the product per vertex or per pixel. The following sections go into some detail on how language features are mapped into assembly constructs. While it is not necessary to understand how to write vertex or pixel shader assembly, it can be quite helpful to understand the basic limitations and efficiencies of the assembly models. Understanding key assembly features is essential to generating compact and efficient shaders.
41
row major order, depending on how the matrix is used. This optimization can be quite useful for situations in which a matrix is generated in either a pixel or vertex shader. As mentioned earlier, for input matrices, the compiler always uses either column major or row major storage format based on a compiler flag, with column major being the default method.
42
The following is code generated with a float index versus an int index:
OutPos = mul(Pos, WorldArray[Index]); // Index declared as float frc r0.w, r1.w add r2.w, -r0.w, r1.w mul r9.w, r2.w, c61.x mova a0.x, r9.w m4x4 oPos, v0, c0[a0.x] // Index declared as int mul r0.w, c60.x, r1.w mova a0.x, r0.w m4x4 oPos, v0, c0[a0.x]
43
shader models. In models that do not support any form of branching, both sides of an if must be executed and the output chosen based on which side of the if would have been taken. Having come from the CPU programming world, this form of execution is a bit different than most HLSL shader writers would expect. Common optimizations that you would use on a CPU to avoid expensive operations will not work as expected on shader models that dont support branches, since both the expensive path and the cheap path will be executed. Some shader models support different levels of branching: predicated instructions, static if blocks, and dynamic if blocks. Example using if in vs_1_1:
if (Value > 0) Position = Value1; else Position = Value2;
Assembly generated:
// calculate lerp value based on Value > 0 mov r1.w, c2.x slt r0.w, c3.x, r1.w // lerp between Value1 and Value2 mov r7, -c1 add r2, r7, c0 mad oPos, r0.w, r2, c1
The most common branching support in current hardware shading models is static branching. Static branching is a capability in a shader model that allows for blocks of code to be switched on or off based on a Boolean shader constant. This is a very convenient method for enabling/disabling potentially expensive code paths based on the type of object currently being rendered. Between draw calls, you can decide the various features that you want to support with the current shader and then set the Boolean flags required to get that behavior. The best part about this method is that any instructions that are disabled by the Boolean constant are completely skipped during execution. The disadvantage is that you can only change the if blocks that are enabled/disabled at a
44
low frequency (i.e., between draw calls). In contrast, using the execute-both-sides approach, it is possible to dynamically choose between the outputs of the two paths dynamically at a per-pixel or per-vertex level. The most familiar branching support is dynamic branching. The dynamic branching support offered by some shader models is very similar to that offered by a standard CPU. The performance hit is the cost of the branch plus the cost of the instructions on the side of the branch taken. This execution cost is comparable to what most people are familiar with optimizing for in CPU-side code. The problem with this form of branching is that it is not available on most hardware and is currently only available for vertex shaders. Optimizing shaders that work with these models is very similar to optimizing code that runs on a CPU.
45
If the parameter were declared as a float4, then the w component would be set to 1.0f by the hardware loading the input registers. The compiler cannot do this type of optimization automatically, since this optimization requires knowledge of what data is in the vertex buffer. Another optimization is to make sure to declare all input parameters with the appropriate type for their usage in the shader. For example, if the incoming data is integer and the data is going to be used for addressing purposes, then it is important to declare the parameter as an int to avoid truncation. The subtle issue with declaring inputs as ints is that the values in the input should truly be integer values. Otherwise, the generated code might not run correctly due to the optimizations that the compiler will make based upon the assumption that the input data is truly integer data.
46
compiler that the operation should be performed with the lowest precision possible. Some pixel shader hardware can take advantage of performing other operations at a lower precision as well. Here is an example of log versus logp:
float LogValue = log(Value); // counts as 10 instructions // on vs_1_1 log r0, c0; float LogValue = (half)log(Value) // counts as 1 instruction on // vs_1_1 logp r0, c0
47
have is the existence of free source and dest modifiers (i.e., the ability to clamp values to the 0 to 1 range, take the complement of a source, negate a source, bias a source, etc.). These modifiers are extremely handy when generating shaders that accomplish a lot in a small number of instructions. The compiler automatically matches all modifiers that it can, but it is helpful if the HLSL shader writer thinks in terms of using these modifiers to accomplish certain operations. In fact, some intrinsics were added to HLSL to make this type of shader writing easier. For example, it is recommended that you use the saturate() intrinsic when trying to generate a free _sat modifier in a pixel shader. We now present a series of HLSL code sequences that generate free source modifiers when compiling to ps_1_x targets.
48
It is important to note that the Tex*2 -1 version is recommend because it generates more optimal code in ps_2_0 targets and beyond.
Note that _bias cannot be done in ps_1_1, ps_1_2, or ps_1_3 unless the source is known to be in the range of 0 to 1. That is, it must have been previously saturated.
49
The _x2, _x4, _x8, _d2, _d4, and _d8 Destination Write Modifiers
A set of destination write modifiers exists in the ps_1_x models, and it is possible to write HLSL code to cause the compiler to generate them in the resulting asm. The modifiers to double (_x2), quadruple (_x4), and halve (_d2) the result of the instruction are supported on ps_1_1 through ps_1_3 models, while the ps_1_4 model supports all six of the modifiers _x2, _x4, _x8, _d2, _d4, and _d8. The following code will generate the corresponding modifiers for N = 2, 4, 8, 0.5, 0.25, or 0.125:
static const float N = 2; float4 main( float4 Col[2] : COLOR0 ) : COLOR0 { return (Col[0] + Col[1] )*N; }
50
51
52
support or no shaders at all. An excellent example of this kind of use of techniques is the Water sample in the DirectX SDK. This sample uses several different techniques that are targeted at different generations of hardware. Of course, the more basic techniques that require fewer textures and generally less sophistication dont look as impressive, but thats the point; D3DX Effects let you manage this quality/speed trade-off very naturally.
Effect Files
We dont go into all of the facets of effects here, but you should understand the basic structure of an effect file in order to see how it can be used with HLSL. A typical effect file might look something like this:
// Lighting constants VECTOR g_Leye; float4 GlobalAmbient = 0.5; float Ka = 1; float Kd = 0.8; float Ks = 0.9; float roughness = 0.1; float noiseFrequency; MATRIX MATRIX MATRIX MATRIX MATRIX matWorldViewProj; matWorldView; matITWorldView; matWorld; matTex0;
TEXTURE tVolumeNoise; TEXTURE tMarbleSpline; sampler NoiseSampler = sampler_state { Texture = (tVolumeNoise); MinFilter MagFilter MipFilter AddressU = = = = Linear; Linear; Linear; Wrap;
53
AddressV = Wrap; AddressW = Wrap; MaxAnisotropy = 16; }; sampler MarbleSplineSampler = sampler_state { Texture = (tMarbleSpline); MinFilter = Linear; MagFilter = Linear; MipFilter = Linear; AddressU = Clamp; AddressV = Clamp; MaxAnisotropy = 16; }; float3 snoise (float3 x) { return 2.0f * tex3D (NoiseSampler, x) - 1.0f; }
float4 soft_diffuse(float3 Neye, float3 Peye) { // Compute normalized vector from vertex to light in eye space (Leye) float3 Leye = (g_Leye - Peye) / length(g_Leye - Peye); float NdotL = dot(Neye, Leye) * 0.5f + 0.5f; // N.L return float4(NdotL, NdotL, NdotL, NdotL); }
54
// Compute normalized vector from vertex to light in eye space (Leye) float3 Leye = (g_Leye - Peye) / length(g_Leye - Peye); // Compute Veye float3 Veye = -(Peye / length(Peye)); // Compute half-angle float3 Heye = (Leye + Veye) / length(Leye + Veye); // Compute N.H float NdotH = clamp(dot(NNeye, Heye), 0.0f, 1.0f); float float float float float NdotH_2 NdotH_4 NdotH_8 NdotH_16 NdotH_32 = = = = = NdotH NdotH_2 NdotH_4 NdotH_8 NdotH_16 * * * * * NdotH; NdotH_2; NdotH_4; NdotH_8; NdotH_16;
return NdotH_32 * NdotH_32; } float4 hlsl_bluemarble (float3 P : TEXCOORD0, float3 Peye : TEXCOORD1, float3 Neye : TEXCOORD2) : COLOR { float4 Ct; float4 Ci; float3 NNeye; float marble; float f; // Divide down to nice frequency P = P/16; marble = -2.0f * snoise(P * noiseFrequency) + 0.75f; NNeye = normalize(Neye); // Cubic interpolation of f along color spline (gloss in alpha) Ct = tex1D (MarbleSplineSampler, marble); // Color from illumination Ci = Ct * (Ka * ambient() + Kd * soft_diffuse(NNeye, Peye)) + Ct.w * Ks * specular(NNeye, Peye, roughness);
55
return Ci; } VERTEXSHADER asm_marble_vs = decl {} asm { vs.1.1 dcl_position v0 dcl_normal v3 m4x4 oPos, v0, c[0] m4x4 r0, v0, c[17] mov oT0, r0 m4x4 oT1, v0, c[4] m3x3 oT2.xyz, v3, c[8] }; // Transform position to clip space // Transformed Pshade (using texture matrix 0)
technique technique_hlsl_bluemarble { pass P0 { // Only need to map variable names to hardware // registers like this for asm shaders: VertexShaderConstant[0] = <matWorldViewProj>; VertexShaderConstant[4] = <matWorldView>; VertexShaderConstant[8] = <matITWorldView>; VertexShaderconstant[12] = <matWorld>; VertexShaderConstant[17] = <matTex0>; VertexShader = <asm_marble_vs>; PixelShader = compile ps_2_0 hlsl_bluemarble(); CullMode = CCW; } }
We now explain this example effect file from the bottom up. The very last block of code in this effect file defines a technique called technique_hlsl_bluemarble, which has only one rendering pass.
56
This single pass will use a vertex shader written in assembly language and a pixel shader written in HLSL. The first several lines in this pass declare five different matrices, which will be loaded into specific hardware constant registers from high-level effect variables when this pass is invoked. This explicit mapping is only done in the effect file for asm shaders. There are no explicit mappings done like this for the pixel shader, since it is written in HLSL. The next line declares the vertex shader to be used in this pass, an assembly shader called asm_marble_vs:
VertexShader = <asm_marble_vs>;
The following line defines the pixel shader, which will be compiled for the ps_2_0 target using the hlsl_bluemarble() function as its main entrypoint:
PixelShader = compile ps_2_0 hlsl_bluemarble();
The block of code preceding the technique definition is the vertex shader written by hand in assembly language. Above this is hlsl_bluemarble, the main entrypoint for our HLSL pixel shader. If you take a look at this code, you can see that, in addition to the tex1D() intrinsic, this function calls several other utility functions, such as ambient() and soft_diffuse(). These utility functions are defined earlier in this effect, and since were compiling for the ps_2_0 target, they are inlined into the resulting assembly. If you look above the utility functions, you can see the declaration of a pair of samplers called NoiseSampler and MarbleSplineSampler. These are declared just as before except that when used in an effect file, they can also be followed by the bracketed code defining the addressing and filtering sampler state to be used. Textures may also be defined in effect files, as shown above the sampler declarations. At the very top of the effect, we see the declaration of a series of global variables, which are settable from the application level.
57
With all of the proper constants set up, we can set the desired technique and render all of its passes (in this case, just one):
m_pEffect->SetTechnique(m_pEffect->GetTechniqueByName("technique_hlsl_ bluemarble")); m_pEffect->Begin(&cPasses, 0); for (iPass = 0; iPass < cPasses; iPass++) { m_pEffect->Pass(iPass); // Render geometry } m_pEffect->End();
As you can see, this is a straightforward process that hides several unnecessary burdens from the application. For example, the
58
application never needs to know into what hardware constant register to load g_Leye or to which sampler the noise texture should be bound. These details are all managed by the D3DX Effects framework.
59
Notice in the above code that the D3DXCompileShader*() routines have some additional parameters not found in the D3DXAssembleShader*() routines. Specifically, it is necessary to specify the name of the main entrypoint for the shader as well as the compile target (main and vs_1_1 above). You can also optionally specify values of #defines, include files, and flags to control generation of debug information, optimization, validation, and matrix data ordering. All of these inputs are passed to the D3DXCompileShader*() routines via the first six parameters. The last three parameters are pointers to buffers that get filled in by the compiler the binary assembly code, human-readable error messages (optional), and the constant table. The binary assembly code gets passed to CreatePixelShader() or CreateVertexShader(), while the constant table must be used by the application to know how to load the proper constant data prior to executing a given HLSL shader. We devote the remainder of this discussion to the final parameter returned from the D3DXCompileShader*() routine, as this is the most critical piece to understand when integrating HLSL shaders into an application without the use of effects. You can refer to the documentation for discussion of the other parameters.
60
D3DXHANDLE handle; if (handle = m_PS_ConstantTable->GetConstantByName(NULL, "ringFreq")) { m_PS_ConstantTable->SetFloat(m_pd3dDevice, handle, m_fRingFrequency); } if (handle = m_PS_ConstantTable->GetConstantByName(NULL, "lightWood")) { m_PS_ConstantTable->SetVector(m_pd3dDevice, handle, &lightWood); }
Likewise, textures and sampler states must be set up correctly, as shown in the following code snippet:
if (handle = m_PS_ConstantTable->GetConstantByName(NULL, "NoiseSampler")) { m_PS_ConstantTable->GetConstantDesc(handle, &constDesc, &count); if (constDesc.RegisterSet == D3DXRS_SAMPLER) { m_pd3dDevice->SetTexture (constDesc.RegisterIndex, m_pVolumeNoiseTexture); // Set sampler states appropriate for the Noise Sampler m_pd3dDevice->SetSamplerState (constDesc.RegisterIndex, , ); } }
The implication of this, of course, is that render states, texture stage states, and sampler states must be maintained by the application and are in no way encapsulated in the HLSL shader code as they would be using D3DX Effects. Of course, particularly in any kind of shader-authoring tool, there may be no a priori application knowledge of the names of variables or samplers expected. In this case, it is necessary to use the ID3DXConstantTable::GetDesc() method to determine the number of constants in the constant table. Subsequently, the application can use the ID3DXConstantTable::GetConstantElement() method rather than the ID3DXConstantTable::GetConstantByName() method used in the code snippets above. In general, it is a good
61
idea to familiarize yourself with the ID3DXConstantTable interface if you intend to integrate support for HLSL shaders into your application without the use of D3DX Effects.
SDK Updates
Since the release of DirectX 9.0 and the subsequent DirectX 9.0a patch, Microsoft has committed to releasing periodic SDK updates for developers. These SDK updates do not contain Direct3D run-time changes, but they do include upgrades to important D3DX tools, including the HLSL compiler. It is highly recommended that you keep up to date with the latest release of DirectX SDK updates so that you are using the latest compiler revision and generating the best possible asm from your HLSL source.
Conclusion
We have presented a detailed description of the Direct3D High Level Shading Language (HLSL), which is one of the most significant new features of DirectX 9.0. We have presented an introduction to the mechanics of the language itself and reinforced key concepts with sample shaders. We have also given some insight into the compilation process and how you can best write shaders for optimal performance. We hope this introduction has provided you with a solid foundation so that you can understand the HLSL shaders presented in later chapters and begin integrating HLSL shaders into your own projects.
Acknowledgments
Thanks goes to ATIs 3D Application Research Group for providing the sample HLSL shaders. Thanks to Dan Baker and Loren McQuade of Microsoft for their feedback and specifically their contributions to the section on optimizations. Thanks also to Mark Wang and Wolfgang Engel for valuable comments that resulted in greater clarity.
Introduction
DirectX 9 introduces the new shader model 2.0 whose capabilities clearly exceed their DirectX 8 counterparts. However, the same DirectX 9 release also includes the 3.0 shading model whose advanced vertex and pixel processing features open the door to a plethora of new techniques and effects previously not possible in real-time 3D rendering. While the extended shader model 2.x offers some functionality common to its 3.0 counterpart, its availability depends on a number of capabilities that may or may not be exposed, depending on implementations. Vertex and pixel shaders 3.0 raise the bar and require a base feature set for 3D acceleration hardware supporting this model, making it easier to determine the capabilities of the rendering device. They also share the same unified structure and syntax, making the writing of shaders an intuitive and straightforward process. For this reason, a vs_3_0 program must always be associated with a ps_3_0 program and vice versa. This article describes the new features of this shader model in detail while giving practical examples of effects that can be implemented with it.
63
64
; Declare outputs dcl_position o0.xyzw dcl_texcoord0 o1.xy dcl_texcoord1 o1.zw dcl_texcoord2 o2.xyz dcl_fog o2.w dcl_psize o3
65
As the number of vertex shader outputs and pixel shader inputs is the same (12), this feature is not as useful in the pixel shader as it is in the vertex shader. The preferred method of selecting pixel shader inputs is arbitrary source swizzling, which is covered later in this article. Flexible input and output declarations also allow different vertex and pixel shaders to be paired together without having to ensure they all use exactly the same register assignment. This can be a useful feature when dealing with a large number of vertex and pixel shader programs.
Predication
The predicate register (p0) is a set of four Boolean flags (one per x, y, z, and w channel) that is basically a dynamic write mask. It enables shader instructions to be performed on a per-channel basis based on the results of previous calculations. The flags in the predicate register are set with the setp_comp p0, src1, src2 instruction, where comp is a comparison mode (greater than, less than, etc.), p0 is the predicate register, and src1 and src2 are two input registers. The comparison is performed four times on the corresponding components of the source registers, and the results are stored in the Boolean flags of the predicate register. For example, the following code sets the predicate register components to (false, true, false, false):
def c0, def c1, mov r0, setp_gt 0.0f, 2.0f, -4.0f, 1.0f 4.0f, -8.0f, -2.0f, 1.0f c0 p0, r0, c1
Once the predicate register is set, its contents can be used to allow or prevent per-channel operations to be carried out. To enable predication, (p0) is added in front of the corresponding arithmetic or texture instruction. For example, based on the predicate register contents defined above, only the .y component of the destination register r0 is affected by the result of the following instruction:
66
A negate modifier (!) and single-component replicate swizzle can also be used with the predicate register. In the following example (and using the same predicate contents as before), all the components of r0 receive the multiplication results:
(!p0.z) mul r0, r1.x, c1
Using predication as a dynamic write mask has its uses; for very short branching sequences, it should be preferred instead of the dynamic branching instructions like if_comp. The predicate register also uses fewer temporary registers compared to the equivalent non-predicated sequence of instructions, which can help compiler optimizations and may produce better code. Static and dynamic flow control instructions like loop, if_comp, etc., may not be used in predication mode, although the predicate register can be used as a branching condition using the dedicated flow control instructions if_pred, callnz_pred, and break_pred. A replicate swizzle must be used with those instructions in order to determine which component triggers the branch.
67
advantages to this feature: flexibility and performance. Flexibility because different branching instructions are now executed at the vertex or pixel level, allowing complex code trees to be implemented. Performance because code can now be run only for the vertices or pixels that require it (although the performance gained from unexecuted code may vary depending on hardware implementations). Dynamic branching instructions can be nested up to 24 levels deep; a description of the instructions follows:
n
if_comp: Conditionally performs the next sequence of instructions based on a comparison. The else/endif instructions are used to delimit the if blocks. if_pred: Conditionally performs the next sequence of instructions based on the value of the predicate register. The else/endif instructions are used to delimit the if blocks. callnz_pred: Conditionally calls a subroutine based on the value of the predicate register. The ret instruction is used to return from the subroutine. break_pred: Conditionally breaks from a loop/endloop or rep/endrep block based on the value of the predicate register. break_comp: Conditionally breaks from a loop/endloop or rep/endrep block based on a comparison.
A typical application of dynamic branching is the common (N.L) calculation (dot product of the normal and light vector). Depending on the result of the dot product, the rest of the lighting equation may or may not be calculated, improving performance in the process. The following pixel shader illustrates this:
ps_3_0 ; User-defined constants def c0, 0.0f, 0.0f, 0.0f, 1.0f ; Declare samplers dcl_2d s1 ; Declare inputs dcl_texcoord0 v0.xy
; Normal map
; Texture coordinates
68
dcl_texcoord1 texld r2, v0, s1 nrm r1, v1 dp3 r0.w, r2, r1 if_gt r0.w, c0.x
v1.xyz
; Un-normalized light vector ; Retrieve pixel normal ; Normalize light vector ; Light calculation (N.L) ; if (N.L)>0
; Performs the rest of the lighting equation: specular, ; attenuation, light maps, etc. r3 contains the final pixel ; color else ; Output black (or any other ambient color) mov r3, c0 endif mov oC0, r3 ; Output pixel color
The same principle can be applied to shadows (in or out of shadow), light attenuation (distance from the light exceeds maximum range), etc. Many optimizations can be performed using dynamic branching.
NOTE For small portions of conditional code, it is usually preferable to use the predicate register or other comparison instructions than to start a dynamic branch. There may be a setup cost associated with dynamic branching, and so running a few instructions for all conditions could be faster than running fewer instructions in separate branches.
The break instructions are used to break from loops (using loop/endloop or rep/endrep instructions), which can be useful for iterative mathematical operations. By breaking when the right result is found, the remaining loop iterations are not executed, thus improving performance.
69
Dynamic flow control allows numerous new effects to be implemented in vertex or pixel shaders. Recursion, tree structures, ray tracing, etc., are all possible with dynamic flow control.
Arbitrary Swizzle
Arbitrary source swizzling is now supported for both vs_3_0 and ps_3_0 (arbitrary source swizzling was not supported in ps_2_0). This feature allows the selection of source components to be specified in any order and eliminates the need to copy or modify registers when their component arrangement does not match the format required for the next instruction. Arbitrary source swizzling is compatible with texture instructions (in both vertex and pixel shaders), thus, texture coordinates can be selected in any order from a given set of coordinates. This is very useful when filter kernels are involved, as several sample points can be fetched simply by using source swizzles on the texture coordinates. The following example fetches five samples in an x-shaped kernel from a single set of 2D texture coordinates:
;--------------------------------------------------------------; Constants specified by the app ; c0 = -1/TextureWidth, 1/TextureHeight, ; 1/TextureWidth, -1/TextureHeight ;--------------------------------------------------------------ps_3_0 ; Declare samplers dcl_2d s0
; Declare inputs dcl_texcoord0 v0.xy ; Texture coordinates UV ; Prepare all possible texture coordinate values add r0, v0.xyxy, c0 ; r0 = (U-texel, V+texel, ; U+texel, V-texel) ; Fetches texld r1, texld r2, texld r3, all 5 samples ('X' v0, s0 ; Texel r0.xw, s0 ; Texel r0.zw, s0 ; Texel shape) at (U, V) at (U-texel, V-texel) at (U+texel, V-texel)
70
texld r4, r0.xy, s0 ; Texel at (U-texel, V+texel) texld r5, r0.zy, s0 ; Texel at (U+texel, V+texel)
Interestingly, arbitrary source swizzling also works on the sampler registers. It is possible to swap or replicate color channels by using the appropriate swizzle with the sampler register. For instance, the following instruction changes the channel ordering from the default RGBA to ABGR when sampling a texel:
texld r0, v0, s0.abgr
Arbitrary source swizzles not only improve performance by avoiding copy or replicate instructions, but they also make shader code more readable by doing so.
; Declare inputs dcl_texcoord0 v0.xy ; Texture coordinates ; Retrieve position data texld r0.xy, v0, s0 ; Sample RG data into r0.xy texld r0.zw, v0, s0.abrg ; Sample RG data into r0.zw ; r0.xyzw now contains position data
71
NOTE The predicate register can also be used to specify dynamic write masks on a texture sampling instruction.
vs_3_0 Features
Registers
A total of 32 temporary registers (r0...r31) are available in the vs_3_0 model, compared to a mere 12 for the vs_2_0 model. This number of registers provides more storage for complex mathematical functions as well as extra parameters for subroutines (see the Static and Dynamic Flow Control section above). To increase the flexibility of the shader, the 12 output registers have been renamed to oX (o0-o11) and can now contain any float values that will be iterated and supplied to the pixel shader. Of those, only ten are custom four-component output registers, as one register must be declared as the output position and the remaining one may only be used for point sprite size. For more information on vertex shader declarations, see the Flexible Input and Output Declarations section in this article. The loop counter register aL, used in vs_2_0 to index constants within a loop, can now also be used to relatively address both input and output registers. This enables the same piece of code to operate on a set of different inputs. This can be useful, for instance, to apply the same transformations to a set of vertex positions or output the results of per-vertex light vector calculations to texture coordinates. The following code gives an example of output register indexing in a vs_3_0 program.
;--------------------------------------------------------------; Constants specified by the app ; c0-c3 = Global transformation matrix (World*View*Projection) ; c12-c19 = Model space positions of light sources ;--------------------------------------------------------------vs_3_0 ; Declare constant integer for looping
72
defi i0, 8, 2, 1, 0 def c4, 0, 0, 0, 1 ; Declare input registers dcl_position0 v0 ; Declare output registers dcl_position0 o0.xyzw dcl_texcoord0 o2.xyzw dcl_texcoord1 o3.xyzw dcl_texcoord2 o4.xyzw dcl_texcoord3 o5.xyzw dcl_texcoord4 o6.xyzw dcl_texcoord5 o7.xyzw dcl_texcoord6 o8.xyzw dcl_texcoord7 o9.xyzw
; Loop 8 times, starting from 2 and ; incrementing by 1 each iteration ; Static constant
; Input position
; Output position
; Set r0.w to 1 (used in distance calculation later on) mov r0.w, c4.w ; Lighting pre-processing loop aL, i0
; Compute vertex-to-light vectors and distance sub r0.xyz, c[aL+10], v0 ; Subtract model space light ; position from vertex position nrm r1, r0 ; Normalize vector rcp r1.w, r1.w ; 1/(1/distance) = dist(light, vertex) mov o[aL], r1 endloop ; Store result in corresponding texture ; coordinate output
Other registers (16 input registers, 256 constant float registers, 16 constant integer registers, 16 constant Boolean registers, address register) remain unchanged compared to vs_2_0.
73
Instructions
The new vs_3_0 model supports a minimum of 512 instructions in a vertex shader program compared to 256 for the vs_2_0 model. Note that the number of executed instructions can potentially be made larger by the use of loops and subroutines within the vertex shader. Supporting longer shaders not only enables more operations to be performed like advanced animation, complex vertex lighting, etc., but also concatenating different shaders into a larger one reduces or even eliminates vertex shader state changes, improving performance. The _abs source modifier is a new addition to vs_3_0. It forces the absolute value of a source register to be used in an instruction. Note that it takes precedence over the negate modifier (-) so that a negative value can always be guaranteed. Here are a few examples:
add r0, r8_abs.x, c10 mad r0, r1, r2, -v_abs[2] ; ; ; ; ; Adds the absolute value of r8.x and c10 together Multiplies r1 and r2 and subtracts the absolute value of v2. Note that v2_abs also works.
In an effort to unify the vertex and pixel shader models, the _sat instruction modifier that was available in ps_2_0 has been included in vs_3_0. Applying this modifier clamps the result to the [0,1] range:
sub_sat r0, r0, r1 ; Subtracts r1 from r0 and clamps ; the result to the [0,1] range
New instructions in the vs_3_0 model that relate to dynamic branching are discussed in the Static and Dynamic Flow Control section.
Texture Sampling
The 2.0 model introduced basic texture sampler functionality to the vertex shader unit. This access was limited to a single texture with a fixed set of texture coordinates either read directly from
74
the vertex stream (which supports filtering) or derived from the vertex index (which supports point sampling only) and only in combination with n-patches. The 3.0 model introduces true vertex texturing support, which is texture access from the vertex shader at the same level of functionality and flexibility existing in the pixel shader unit. Using this new functionality is also very similar to using textures in the pixel shader; textures (SetTexture) and sampler states (SetSamplerState) simply have to be set for the four available vertex texture sampler stages (D3DVERTEXTEXTURESAMPLER0, D3DVERTEXTEXTURESAMPLER1, D3DVERTEXTEXTURESAMPLER2, and D3DVERTEXTEXTURESAMPLER3) with the same arguments used for regular textures. These samplers also need to be declared as part of the shader program using the dcl_textureType s# syntax, where the texture type can be 2d, cube, or volume. The only difference with textures in the pixel shader is that anisotropic filtering is not supported for vertex textures. Also, because the rate of change information is not available, the shader or application has to compute the level of detail (LOD) and provide that information as a parameter to the actual texture sampling instruction. Hence, only the texldl instruction is supported, for which the particular mipmap level (LOD) to sample has to be specified as the fourth component of the texture coordinate. Given that texture sampling is now implemented using an instruction (unlike the 2.0 model, where the sampled data appears in an input register), it is now possible to modify the texture coordinates and LOD before sampling, meaning that procedural texture coordinates are possible as well as dependent texture reads (using the result of one texture read to read into another texture). The number of reads and dependent reads is unlimited in the 3.0 model. Vertex texturing allows the implementation of huge lookup tables, effectively using the texture as a massive data storage area that can be accessed freely from within the vertex shader. Up to four variables can be fetched from the table per read (RGBA components). Completely flexible displacement mapping (reading a
75
value from a texture and using it to displace a vertex e.g., along its normal vector) is also possible. This functionality is no longer limited to point sampling (pre-sampled displacement mapping in vs_2_0) or geometry with n-patches enabled. The following is a vertex shader example performing displacement mapping:
;--------------------------------------------------------------; Constants specified by the app ; c0-c3 = Global transformation matrix ; c11.x = Scaling factor for displacement ;--------------------------------------------------------------vs_3_0 ; Samplers dcl_2d
s0
; Declare sampler
; Input registers dcl_position v0 dcl_normal v3 dcl_texcoord0 v4 ; Output registers dcl_position0 o0.xyzw dcl_texcoord0 o1.xy ; Sample texture texldl r0, v4, s0
; Displacement mapping mul r2, v3, c11.x mul r2.xyz, r2, r0.x add r0.xyz, v0, r2 ; Vertex transformation m4x4 o0, r0, c0 mov o1.xy, v4
; ; ; ; ;
Create displacement vector (based on normal vector) Multiply unit displacement vector by displacement scalar Displace vertex position
A form of geometry loopback can also be implemented where a complex vertex shader (e.g., very complex skinning and lighting models) is executed once, and the resulting vertex information is
76
stored out into several textures using a trivial pixel shader program. It is then possible to read this vertex information back and send it to the pixel shader multiple times to implement some complex multipass effect. This same principle can also be used to implement geometry images, as described by Xianfeng Gu, Steven J. Gortler, and Hugues Hoppe [Gu], where impressive geometry compression is achieved by using textures as data storage for a models vertex positions and normals. Similarly, it is also possible to generate procedurally animated geometry, where an objects vertex positions and normals are stored within a texture that is then processed recursively by a complex pixel shader program to create a procedurally animated object. This principle is explained in detail in the article Cloth Animation with Pixel and Vertex Shader 3.0 in ShaderX2: Shader Programming Tips & Tricks with DirectX 9.
StreamIndex indicates which stream is to have its frequency set, while Frequency is the frequency to which it will be set.
One practical usage of vertex stream frequencies is vertex compression. A 3D model can be separated into chunks of vertices; each chunk is composed of full-precision 3D coordinates, indicating the chunk position, and a number of lower-precision offset vertices. The vertex shader adds the base position to each of the offset values to generate the untransformed vertex. The first stream is given a frequency indicating how many offset vertices are to use the same base position data, while the
77
frequency of the second stream remains unchanged. Figure 1 illustrates this principle for a given set of 16 vertices.
Another typical usage of stream frequencies is to use a vertex stream to control the animation of individual (or groups of) triangles in a vertex buffer. For example, explosions can be controlled at the triangle level by setting the desired animation data for vertices in a control stream set to a frequency of 3 (one for each triangle in the model vertex buffer). The vertex shader then transforms each triangle vertex in the model using the animation data in the control stream. The frequency can be set to higher values so that blocks of triangles can be transformed together. Any type of vertex data can be shared between groups of vertices. For instance, a vertex stream containing triangle normals could be set up with a frequency of 3 to avoid duplicating the normal vector across all three vertices defining a face in the
78
associated triangle list. Hierarchical sub-mesh information could also be stored using this feature by using several streams of various frequencies. Future versions of DirectX might implement vertex stream stepping as well as frequency, enabling geometry instancing to be performed by looping streams multiple times.
ps_3_0 Features
Registers
The ps_3_0 model supports 32 temporary registers and 256 constant registers (224 float, 16 integer, and 16 Boolean). This increase enables more data to be manipulated or stored compared to the ps_2_0 model, which only supports 12 temporary and 32 constant registers. While ps_2_0 supported eight float and two integer input registers, all ten input registers of ps_3_0 are now in float format. Thus, interpolated colors from the vertex shader can be passed as float, increasing their precision in the process. Predication and static/dynamic flow control are controlled by two additional registers p0 and aL. Note that input register indexing can also be performed using the loop counter register aL. A face register is now available in ps_3_0, which is used to indicate whether the incoming pixel is part of a front- or back-facing triangle (front-facing triangles are defined by a clockwise vertex ordering). Typical usages are two-sided lighting and volume algorithms. The sign of the vFace register determines whether the pixel is front- or back-facing, and the if_cmp and setp instructions are used to test for the sign of the face register. The following example sets front-facing pixels to red and back-facing pixels to green using predication (note that the vFace register must be declared prior to being used in a pixel shader program):
ps_3_0 ; Declare face register dcl vFace
79
; Declare constant def c0, 0, 0, 0, 1 ; Set predicate to true if front-facing, false otherwise setp_gt p0.x, vFace, c0.x ; Set front faces to red and back faces to green (p0.x) mov oC0, c0.wxx (!p0.x) mov oC0, c0.xwx
Another useful register present in the ps_3_0 model is the position register vPos. Once declared, this register contains the current pixel position in screen coordinates. As such, only the x and y components of vPos are valid. This facility is interesting for all sorts of post-process effects operating on a surface containing a rendered scene. For example, deferred shading algorithms can use the vPos register to retrieve the current pixel position of a volume and thus directly use it as texture coordinates to sample data in screen-aligned textures. As a simple example, the following code renders every second horizontal line with a different color:
ps_3_0 ; Declare position register dcl vPos.xy ; Declare constant def c0, 1, 0, 0, 0.5 ; Divide position by 2 mul r0.xy, vPos, c0.w ; Retrieve fractional part frc r0.xy, r0 ; Set predicate to true if fraction != 0 setp_ne p0.xy, r0, c0.y ; Output different colors based on predicate register (p0.y) mov oC0, c0.xyyx ; Output red (!p0.y) mov oC0, c0.yxyx ; Output green
80
Instructions
As with the vs_3_0 model, ps_3_0 supports a minimum instruction count of 512. This is a considerable increase compared to the ps_2_0 model that only supports a minimum of 96 instructions (64 arithmetic and 32 texture instructions). Indeed, complex shaders like shadow mapping with percentage-closer filtering or large filter kernels could already exceed the ps_2_0 limit. Also, these 512 instructions could be arithmetic or texture instructions, as there is no restriction on their type. Note that the number of executed instructions can potentially be made larger by the use of loops and subroutines within the pixel shader. One obvious advantage to supporting that many instructions is the reduction in pixel shader state changes. By using static flow control, several pixel shaders can be combined into a longer one, and the corresponding code path can be chosen based on a dynamic constant. The increase in performance by reduction of shader state changes is even more significant when the scene uses a large number of different shaders. The _abs source modifier present in vs_3_0 is also available in ps_3_0. It forces the absolute value of a source register to be used in an instruction. For a code example using this modifier, see the Instructions paragraph of the vs_3_0 Features section. Ps_3_0 contains new texture instructions. The selection of a particular mipmap level can be forced by using the texldl instruction and setting the desired MIP level into the w component of the source texture coordinates. A blend between MIP levels can be achieved by setting a fractional value for w. This feature can be useful for micro or detail texturing or to customize texture filtering. Gradient instructions are a new feature of the ps_3_0 model. These new instructions are dsx, dsy, and texldd. Gradient instructions are used to detect the rate of change of a given register across adjacent pixels in the horizontal (dsx) and vertical (dsy) directions. The texldd instruction can then be used to sample a pixel according to the horizontal and vertical rates of changes of the texture coordinates passed to the function. Gradient
81
instructions are generally used to determine the mipmap levels applied to a sampled texel so that custom filtering can be applied. As an example, the following shader determines the rates of change in texture coordinates and feeds them to the texldd instruction:
ps_3_0 ; Samplers dcl_2d
s0
; Texture coordinate
; Compute the horizontal and vertical rates of change in ; adjacent texture coordinates dsx r1, v0 ; Horizontal dsy r2, v0 ; Vertical ; Sample pixel texldd r0, v0, s0, r1, r2
Centroid is an instruction modifier used to adjust the texture sampling location when multisampling is used. This is used to avoid artifacts when a multisampled triangle edge does not cover the center of a pixel but does cover the center of at least one sub-pixel of the multisampled mask. Centroid is used by appending the _centroid modifier to a texture instruction. The following is an example of the pixel shader code that can be used on a scene with multisampling enabled:
ps_3_0 ; Samplers dcl_2d
s0
; Texture
; Texture coordinate
82
Conclusion
The 3.0 shader model is a huge step forward compared to the previous 2.0 model. New features have been introduced while register and instruction limits have been increased dramatically, allowing for much more advanced effects to be implemented. Simplicity has also been greatly enhanced by unifying the vertex and pixel shader models and allowing more flexibility on instructions and registers.
References
[Gu] Gu, X., S. Gortler, and H. Hoppe, Geometry Images, ACM SIGGRAPH 02, pp. 355-361, http://research.microsoft.com/ ~hoppe/.
Introduction
As promised, DirectX 9 has a lot of new functionality, mainly in the programmable pipeline. Floating-point support in pixel shaders gives us what we missed in Direct3D 8 precision in this major part of rendering. Larger shaders and flow control allow more effects. New types of textures (16-bit per component and floating-point components) give us an extra bit of detail. Of course, new hardware is on the market (or heading to the market) ATI Radeon 9700, nVidia GeForceFX, and cards from S3, 3DLabs, and other companies. This article discusses the new possibilities of Direct3D 9. We begin with classic per-pixel shading. First we improve it for version 2.0 shaders great quality and no more lookup textures. Then we utilize 3.0 shaders to show how to do four spotlights in one pass with dynamic flow control and relative addressing. We continue with per-pixel environment bump mapping DirectX 8.1 is presented first (with pixel shader 1.4), and then the new shaders 2.0 version is presented. The Fresnel term is added for a more impressive and realistic effect.
83
84
The end of the article is reserved for two lighting models that are not commonly used in real-time computer graphics. This is mainly due to limitations of the hardware prior to the new versions of DirectX. The Oren-Nayar generalization of the Lambertian diffuse model is implemented with 2.0 shaders. It brings more reality to materials like clay and porcelain. The specular part of the Cook-Torrance model is presented with both pixel shader 1.4 and 2.0 for visual comparison. This model produces very good results for metallic surfaces. The following sections are organized similarly: The whole shader is presented at the beginning of the section, and then it is broken into pieces with necessary explanations. New shader concepts (syntax) are explained in depth.
Per-Pixel Phong
This section covers the possibilities of Phong lighting with new shader models. It is targeted mainly at people upgrading to DirectX 9 from a previous version. Because of this, knowledge of the Phong lighting equation (only a brief review is available here), concepts of per-pixel shading, and DirectX 8 is expected. Most of this can be found in [1] (this article is a direct extension). Other sources of information are [2] and [3]. A shader reference is available on the MSDN DirectX web pages (http://msdn.microsoft.com/ directx).
85
...where n is the surface normal vector, l is the vector from the surface point to the light, and v is the vector from the surface point to the viewers position. Every vector is assumed to be normalized. mdiffuse is the color of diffuse material at a given pixel while mspecular is the color of specular material at a given pixel. Ispotlight is used to simulate a spotlight. In our case, we use additional texture, which is projected in the spotlights direction on every object. Think of it as a projector.
86
87
crs
//Computes light and eye vectors and projector's texture coordinates //-----------------------------add r0, c8, -r8 //Build the light nrm r1, r0 //normalize vector m3x3 oT1.xyz, r1, r9 //to tangent space add r0, c9, -r8 nrm r1, r0 m3x3 oT2.xyz, r1, r9 //build the eye vector //normalize vector //to tangent space
Here is the first change that we can find in the declaration of input registers:
dcl_position dcl_normal dcl_texcoord dcl_tangent v0 v1 v2 v3
In Direct3D 8, vertices were declared only at the time of shader creation outside the shader. We specified which input register in the shader would be loaded with which part of data. In Direct3D 9, we have two declarations:
n
Outside the shader with the SetVertexDeclaration method. In this phase, we define for each input element the stream from which it will be loaded offset in bytes from the start of the stream to the data element, type of data (i.e., float, float3 for vector, etc.), and the semantic of the element (i.e., position, normal, tangent, binormal, etc. this will be used later in the shader). Inside the shader. Here we specify the target register for data with the specified semantic (i.e., position, normal, tangent, binormal, etc. same as outside the shader).
This allows us to write shaders without expectations of a specific input structure and specify a new vertex format for each model still using the same shader. In the example above, we load one position, texture coordinate, normal, and binormal to the first four
88
input registers, but in the vertex stream, this data can be anywhere and even sorted in a different way (i.e., normal, texcoord, position, and tangent). Later in the tangent space base vectors computation, we use the new crs macro instruction to do cross product instead of using the mul r0,r9.zxyw,r11.yzxw; mad r10,r9.yzxw,r11.zxyw,-r0 pair known from previous versions. This command takes two instruction slots and is most likely expanded to mul-mad internally. Also note that we have to explicitly fill the w component of every base vector because this version of the shader does not allow us to use a component that was not filled previously, and crs uses all four components of input registers. The following is the creation of a tangent space base:
m3x3 mov m3x3 mov crs r11.xyz, v1, c0 r11.w, v1.w r9.xyz, v3, c0 r9.w, v3.w r10.xyz, r9, r11 //N to world space //T to world space //Cross product - binormal B=NxT
When light and eye vectors are computed, the nrm macro instruction is used to normalize the vector instead of the previously used three instructions. Here is the transformation of a normalized vector to tangent space:
add r0, c8, -r8 nrm r1, r0 m3x3 oT1.xyz, r1, r9 //Build the light vector //normalize vector //to tangent space
The last shader instruction is used to compute the spotlight texture coordinates for this vertex. The matrix has the following form: 0 0.5 0 -0.5 * 0 0 0.5 0.5 0 0 0 0 0 0 0 1
89
MObjectToWorld is a matrix that transforms vertices from object to world space. MSpotView is a spotlights view transformation matrix, while MSpotProjection is a spotlights perspective transformation matrix. Because of similarity with the camera, they can be easily computed with the D3DXMatrixLookAtLH and D3DXMatrixPerspectiveFovLH functions. The last matrix in the previous equation is used to shift coordinates from range [1...1] (output of clipping matrix) to range [0...1]. Note that this matrix negates the y coordinate because in texture space, y has a value of 0.0 on the top and 1.0 on the bottom and clipping space has 1.0 at the top of the space and 1.0 at the bottom. Usage of macro instructions all over the shader is preferred over the usage of their inline versions. This is because they are not expanded by the Direct3D runtime but by the driver. If the hardware supports a specific macro, it is executed directly; if not, it is safely replaced with supported instructions.
90
//-----------------------------dcl_2d s0 //diffuse texture (gloss in alpha) dcl_2d s1 //normal texture dcl_2d s3 //spotlight texture // Output //-----------------------------// oC0 - output color // // Set up needed vectors - load and normalize //-----------------------------texld r0, t0, s1 //load normal vector mad r1, r0, c31.r, -c31.g //bias normal to range -1,1 nrm r11, r1 //r11 = normalized normal mov r1.xyz, t1 nrm r10, r1 //r10 = normalized light vector mov r1.xyz, t2 nrm r9, r1 //r9 = normalized eye vector // Compute diffuse and specular intensities //-----------------------------dp3 r0.r, r11, r10 //r0 = (n.l) mul r1, r0.r, c31.r //r1 = 2*(n.l) mad r1, r1, r11, -r10 //r1=2(n.l)n-l reflectance vector dp3_sat r1, r1, r9 //r1 = (r.v) pow r0.g, r1.r, c2.r //r0.g = (r.v)^shi specular term cmp r0, r0.r, r0, c31.b //if (n.l)<0 do not light anything
// Modulate texture with computed intensities //-----------------------------texld r1, t0, s0 //load diffuse texture (gloss is in alpha) texldp r4, t3, s3 //load projector map (perspective correct) mul mul mul mul add mul mov r2, r2, r3, r3, r0, r0, r1.a, r0.g r2, c1 r1, r0.r r3, c0 r2, r3 r0, r4 //multiply specular intensity with gloss // and with material's specular color //multiply diffuse intensity with texture // and with material's diffuse color //combine it together //modulate it with spotlight texture //color output
oC0, r0
91
Changes start from the very beginning. We have to specify input registers from the vertex shader. In the previous version, these registers had to be loaded with special texcrd instructions into a temporary register before they could be used in mathematical instructions. Now we just declare them as used with the dcl instruction. (With individual .xyzw components, if we fail to specify components, the compiler will assume .xyzw. If these are not filled in the vertex shader, the run-time shader linker will fail.) Then we can use them as read-only registers freely across the shader. The following is a declaration of input registers.
dcl dcl dcl dcl t0.xy t1.xyz t2.xyz t3.xyzw //texture coordinates //light vector //eye vector //projector texture coordinates
Similarly, we have to declare used texture samplers (simply said, texture stages that we want to use) with the type of used texture (_2d, _3d, or _cube). Again, failing to specify correct input types will generate linker errors at run time. The following is a declaration of input texture samplers.
dcl_2d s0 dcl_2d s1 dcl_2d s3 //diffuse texture (gloss in alpha) //normal texture //spotlight texture
After setup, our first step is to load a normal vector. We are using the texld instruction, but it differs from the pixel shader 1.4 version. In fact, instead of specifying the output register/source texture stage and texture coordinate (texld r0, t0 stands for load texel at coordinates t0 from texture stage 0 to register r0), we use three registers output temporary register, texture coordinates, and input sampler register (this follows the syntax of all instructions in the shaders; the first parameter is an output register, and then there are the inputs). Note that texture coordinates can be specified by an input register from the vertex shader tn or by a temporary register rn (dependent read), and texture sampling can occur anywhere in the shader.
92
As in the previous version, we need to expand the components of a normal vector from range [0...1] to range [1...1]. Because there is no longer a _bx2 instruction modifier, we need to do this manually by multiplication with 2.0 and subtraction of 1.0 (mad r1, r0, 2.0, -1.0 in the shader after replacing constants with values). Now comes the big difference. Thanks to the power of pixel shader 2.0 shaders, we can now normalize each input vector. Mark Kilgard wrote in [4] that doing trilinear filtering of a normal texture denormalizes the resulting vector (which is correct), but the result is acceptable because it simulates dimming of bumps with increased distance from the viewer. When Kilgard saw images done with shaders that were using normalization per pixel, I think he changed his mind. Normalization is done with the nrm macro instruction (the same syntax and behavior as in the vertex shader). Note that we had to use mov r1.xyz, t1; nrm r0, r1 instead of the simple nrm r0, t1 because when D3DXAssembleShader is used to load and compile the shader at run time, the application crashes with an error inside the call (this should be corrected in the next version of the SDK). The command-line compiler psa.exe handles this correctly. The following is a preparation of input vectors.
texld mad nrm mov nrm mov nrm r0, t0, s1 //load normal vector
r1, r0, c31.r, -c31.g //bias normal to range -1,1 r11, r1 //r11 = normalized normal r1.xyz, t1 r10, r1 //r10 = normalized light vector r1.xyz, t2 r9, r1 //r9 = normalized eye vector
The next part of the shader is easy to understand. First we compute the reflection vector (because every input vector is normalized, the reflection vector is normalized too). Then we can do a power calculation in the shader with the pow macro instruction no more lookup textures. If we use mul r0.a, r0.a, 100.0; pow r0.g, r1.r, r0.a instead of pow r0.g, r1.r, c2.r found in the shader, we can have the per-pixel shininess
93
parameter at virtually no additional cost (if we stored it in the alpha component of a normal texture). The last instruction in this block (cmp) is used to disable back lighting of pixels that are not visible from the light source (the angle between n and l is greater than p/2, and therefore the dot product is less than zero). In this case we reset the entire register r0 to zero. To make it clear, r0.r holds diffuse intensity, and r0.g holds specular intensity. The following is a computation of diffuse and specular intensities.
dp3 mul mad dp3_sat pow cmp r0.r, r11, r10 r1, r0.r, c31.r r1, r1, r11, -r10 r1, r1, r9 r0.g, r1.r, c2.r r0, r0.r, r0, c31.b //r0 = (n.l) //r1 = 2*(n.l) //r1=2(n.l)n-l reflectance vector //r1 = (r.v) //r0.g = (r.v)^shi specular term //if (n.l)<0 do not light anything
In the last section of the shader, we load the decal texture with the texld instruction, and we load spotlight texture with perspective correct division with the texldp instruction (this instruction is the same as doing texld r3, t3_dw.xyw in pixel shader 1.4). In the next few instructions, we modulate intensities with light constants and textures. Note that in pixel shader 2.0 the output register is oCn instead of r0 from the previous version, and the pixel shader can have up to four outputs. The following is the final color output.
texld r1, t0, s0 texldp r4, t3, s3 mul mul mul mul add mul mov r2, r1.a, r0.g r2, r2, c1 r3, r1, r0.r r3, r3, c0 r0, r2, r3 r0, r0, r4 oC0, r0 //load diffuse texture (gloss is in alpha) //load projector map (perspective correct) //multiply specular intensity with gloss // and with material's specular color //multiply diffuse intensity with texture // and with material's diffuse color //combine it together //modulate it with spotlight texture //color output
94
HLSL Version
Here we provide the High Level Shading Language version of the above shaders to show the simplicity of such programming. The vertex shader comes first and then the pixel shader. The following is the HLSL vertex shader for Phong lighting.
// Used input structure //-----------------------------struct VS_INPUT { float4 vPosition : POSITION; float3 vNormal : NORMAL; float2 tcCoord : TEXCOORD; float3 vTangent : TANGENT; };
// Used output structure //-----------------------------struct VS_OUTPUT { float4 vClipPos: POSITION; //Clipping space position float2 tcCoord : TEXCOORD0; //texture coordinates float3 vLight : TEXCOORD1; //light vector float3 vEye : TEXCOORD2; //eye vector float4 tcpSpot : TEXCOORD3; //perspective spotlight coordinates }; // Constant registers //-----------------------------float4x4 mToWorld : register(c0); float4x4 mToClip : register(c4); float4 pLight : register(c8); float4 pEye : register(c9); float4x4 mSpot : register(c10);
//world space transposed //world * view * proj //Light position (world space) //Eye position (world space) //Spotlight projection matrix
// function : main // description : vertex shader function // return : VS_OUTPUT // param: // VS_INPUT input : vertex shader input //-----------------------------VS_OUTPUT main(const VS_INPUT input) { VS_OUTPUT output;
95
//The following code outputs the position and texture coordinates //-----------------------------output.vClipPos = mul(input.vPosition, mToClip); output.tcCoord = input.tcCoord; float4 pVertexWorld = mul(input.vPosition, mToWorld); //The following code generates the tangent space base vectors //-----------------------------float3x3 mToTangent; mToTangent[0] = mul(input.vTangent, (float3x3)mToWorld); mToTangent[2] = mul(input.vNormal, (float3x3)mToWorld); mToTangent[1] = cross(mToTangent[0], mToTangent[2]); //Compute light and eye vectors and the projector's texture coordinates //-----------------------------float3 vToLight = normalize(pLight - pVertexWorld); output.vLight = mul(mToTangent, vToLight); float3 vToEye = normalize(pEye - pVertexWorld); output.vEye = mul(mToTangent, vToEye); output.tcpSpot = mul(input.vPosition, mSpot); return output; }
96
sampler smplTexture : register(ps,s0); //decal texture sampler smplNormal : register(ps,s1); //normal texture sampler smplSpot : register(ps,s3); //spotlight texture // function : main // description : pixel shader function // return : PS_OUTPUT // param: // PS_INPUT input : pixel shader input - output from VS // float3 colDiff : c0 - diffuse texture multiplier // float3 colSpec : c1 - specular texture multiplier // float shininess: c2 - specular shininess //-----------------------------PS_OUTPUT main(const PS_INPUT input, uniform float3 colDiff : c0, uniform float3 colSpec : c1, uniform float shininess : c2) { PS_OUTPUT output; // Load and normalize input vectors //-----------------------------float3 vNormal = tex2D(smplNormal, input.tcCoord).xyz; vNormal = normalize(2.0 * vNormal - 1.0); //bias and normalize float3 vLight = normalize(input.vLight); float3 vEye = normalize(input.vEye); // Compute diffuse and specular intensities //-----------------------------float normalDotLight = dot(vNormal, vLight); float3 vLightReflect = 2.0*normalDotLight*vNormal - vLight; float eyeDotReflect = saturate(dot(vEye, vLightReflect)); float specularIntensity = pow(eyeDotReflect, shininess); float4 tmpOutput = {0.0f, 0.0f, 0.0f, 1.0f}; if (normalDotLight>0.0f) { float4 tDecal = tex2D(smplTexture, input.tcCoord); float4 tSpot = tex2Dproj(smplSpot, input.tcpSpot); float3 diffuse = normalDotLight * colDiff * tDecal; float3 specular = specularIntensity * colSpec * tDecal.a; tmpOutput.xyz = tSpot * (diffuse + specular); } output.vColor = tmpOutput;
97
return output; }
Quality Comparison
The possibility of normalization and higher precision in pixel shaders gives us far better results than with previous versions. We can compare the next images, but the difference is more visible in motion. Light intensity is more stable in motion in the pixel shader 2.0 version.
Figure 1: This image shows a vase rendered with pixel shader 1.4 on the left and pixel shader 2.0 on the right. In the small frames, you can see details of highlights.
98
device that is capable of accelerating version 3.0 can accelerate full-featured shaders 2.x, and because of this, it is not very useful for educational purposes. Vertex shader 3.0 can be even larger than the previous version. The shader has a minimum of 512 instructions, but due to the flow control, the device has to be capable of executing a much larger number of instructions (at least 65,536). Dynamic flow control is possible in shaders (loops can be exited depending on the value of a temporary or special predicate register). Other features include texture lookup in the vertex shader and extension of relative indexing from constants to inputs and outputs. Pixel shader 3.0 follows the way of vertex shaders. Dynamic and static flow control is possible, and the instruction count limits are the same. There is no limit on the texture instruction count and no limit on the dependent texture reads. New input registers are introduced the pixel position on the screen and face orientation register. Also, the gradient instructions are new the rate of change of the input registers can be inspected. In the following examples, the capabilities of the highest shader version is used to compute four spotlights in a single pass, all done in one loop. There is no visual change from version 2.0.
99
def c255, 6.0f, 0.0f, 1.0f, 9.0f // Input //-----------------------------dcl_position v0 dcl_normal v1 dcl_texcoord v2 dcl_tangent v3
// Output //-----------------------------dcl_position0 o0 //clip space coordinates dcl_texcoord0 o1.xy //texture coordinates dcl_texcoord1 o2.xyz //eye vector dcl_texcoord2 o3 //Light vector 1 dcl_texcoord3 o4 //Projector texture coordinates dcl_texcoord4 o5 //Light vector 2 dcl_texcoord5 o6 //Projector texture coordinates dcl_texcoord6 o7 //Light vector 3 dcl_texcoord7 o8 //Projector texture coordinates dcl_texcoord8 o9 //Light vector 4 dcl_texcoord9 o10 //Projector texture coordinates
1 2 3 4
//The following code outputs position and texture coordinates //-----------------------------m4x4 o0, v0, c4 //vertex clip position mov o1.xy, v2.xy //Texture coordinates for color texture m4x4 r8, v0, c0 //Transform vertex into world position //The following code generates tangent space base vectors //-----------------------------m3x3 r11.xyz, v1, c0 //N to world space mov r11.w, v1.w m3x3 r9.xyz, v3, c0 //T to world space mov r9.w, v3.w crs r10.xyz, r9, r11 //The cross product - binormal NxT //Compute normalized eye vector and transform it to tangent space //-----------------------------add r0, c8, -r8 //build the eye vector nrm r6, r0 //normalize vector m3x3 o2.xyz, r6, r9 //eye vector to tangent space
100
//In the following loop we are computing normalized light vectors //and transforming them to tangent space //-----------------------------mov r0.y, c255.y //reset constant addressing counter loop aL, i0 //Loop for lights //Index = Counter * DataSize + DataStart mad r0.x, r0.y, c255.x, c255.w //light data index mova a0, r0.x //fill address register add r1, c[a0.x], -r8 nrm r6, r1 m3x3 o3[aL].xyz, r6, r9 m4x4 o3[aL+1], v0, c[a0.x+1] //Build the light vector //normalize vector //light vector to tangent space //transform vertex with light matrix //(get projector texture coordinates) //Increment const addressing counter
In this shader version, there is only one set of output registers. Previously, we had oD# for color, oFog, oPos for clip space position, oPts for point size, and oT# for texture coordinates. Now only the o# registers are available, but they can be used in any way. Therefore, the semantics of every used output register has to be declared at the beginning of the shader in the same way that it was done for inputs. Exactly one dcl_position0 always has to be declared to specify the clipping space vertex position. The following is a declaration of outputs.
dcl_position0 dcl_texcoord0 dcl_texcoord1 dcl_texcoord2 dcl_texcoord3 dcl_texcoord4 dcl_texcoord5 dcl_texcoord6 dcl_texcoord7 dcl_texcoord8 dcl_texcoord9 o0 o1.xy o2.xyz o3 o4 o5 o6 o7 o8 o9 o10 //clip space coordinates //texture coordinates //eye vector //Light vector 1 //Projector texture coordinates //Light vector 2 //Projector texture coordinates //Light vector 3 //Projector texture coordinates //Light vector 4 //Projector texture coordinates
1 2 3 4
After the vertex transformation, the tangent space base creation and eye vector computation (discussed earlier) result in new code.
101
We used two relative addressing registers in this loop: aL for output registers and a0 for input constant data. First lets discuss the loop execution schema and its preparation. We know that starting from register o3 we produce the following output: o[3+2i], which is a vector of light i, and o[3+2i+1], which is the projector texture coordinates for this light. Outside the shader we set up the i0 integer constant register with this information for the loop. The light count is stored in i0.x, starting at the aL value in i0.y (= 0 because we use o3[aL], which is the same as o[3+aL]) and the aL step in i0.z (= 2).
NOTE In the DirectX 9.0 SDK documentation, i0.x and i0.y meanings are switched, but loop works as described here.
The execution of loop aL, i0 can be expressed with the following pseudocode.
RemainingLoops = i0.x; LoopCounter = i0.y; LoopStep = i0.z while (RemainingLoops > 0) { aL = LoopCounter; do_some_code();
102
Light data is stored in constant registers; c[i] is the light position in world space, c[i+1]c[i+4] is the 4x4 light projector matrix, and c[i+5] holds light color data (unused here). The relative address computation that was used for constants and the entire lighting can be seen in the following pseudocode (the original assembler lines are in the comments):
//c255.x - number of registers occupied by one light LightStructureSize = 6; //c255.w c[9] is first constant register with light LightDataStart = 9; //mov r0.y, c255.y LightDataCounter = 0; while (...) { //mad r0.x, r0.y, c255.x, c255.w LightDataIndex=LightDataCounter*LightStructureSize+LightDataStart; //mova a0, r0.x a0 = LightDataIndex; do_light_computation(); //add r0.y, r0.y, c255.z LightDataCounter = LightDataCounter + 1; }
After this preparation, we can see that the light vector and projector texture coordinates computation consists of the following four instructions and is the same as in the previous versions, except for the use of relative addressing:
add r1, c[a0.x], -r8 nrm r6, r1 m3x3 o3[aL].xyz, r6, r9 m4x4 o3[aL+1], v0, c[a0.x+1] //Build the light vector //normalize vector //light vector to tangent space //transform vertex with light matrix //(get projector texture coordinates)
103
//texture coordinates //eye vector //Light vector 1 //Projector texture coordinates //Light vector 2 //Projector texture coordinates //Light vector 3 //Projector texture coordinates //Light vector 4 //Projector texture coordinates
1 2 3 4
104
//bias normal to range -1,1 //r11 = normalized normal //load eye vector //r9 = normalized eye vector
//In the following loop, lighting contribution will be computed for lights //-----------------------------mov r8, c3.b //reset overall diffuse output mov r7, c3.b //reset overall specular output mov r6, c3.b //reset iteration counter - used //to get correct light color loop aL, i0 //This is the light loop mov r1.xyz, v2[aL] //load light vector nrm r10, r1 //r10 = normalized light vector dp3 r0, r11, r10 if_gt r0.r, c3.b mul r1, r0, c3.r mad r1, r1, r11, -r10 dp3_sat r1, r1, r9 pow r0.g, r1.r, c2.r //r0 = (n.l) //Light only if (n.l)>0 //r1 = 2*(n.l) //reflection vector - r1=2(n.l)n-l //r1 = (r.v) //r1 = (r.v)^shi
//We have to use the following code to get the correct light color. //Pixel shader 3.0 cannot address constants relatively. mov r3, c4 setp_gt p0, r6.r, c223 (p0.z) mov r3, c5 (p0.y) mov r3, c6 (p0.x) mov r3, c7 mov texldp mul mul mad mad endif add endloop r4, r1, r1, r1, r8, //move coordinate to temp register //load projector texture //modulate with light color //modulate with light intensity //modulate diffuse intensity with //spotlight and add to overall r7, r0.g, r1, r7 //modulate specular intensity with //spotlight and add to overall //increment iteration counter + 1 v2[aL+1] r4, s3 r1, r3 r1, r3.a r0.r, r1, r8
105
//-----------------------------texld r1, v0, s0 //load diffuse texture mul r8, r8, c0 //multiply overall diffuse with //material diffuse color mul r8, r8, r1 //modulate it with texture mul r7, r7, c1 //multiply overall specular with //material specular color mad r7, r7, r1.a, r8 //modulate it with gloss map and //add computed diffuse color mov oC0, r7 //output the color
First note the change in declaration of input registers; we have to specify a semantic. To obtain the correct results, semantics used in the pixel shader have to match the output semantic in the vertex shader. The following is a declaration of the input semantic.
dcl_texcoord0 dcl_texcoord1 dcl_texcoord2 dcl_texcoord3 dcl_texcoord4 dcl_texcoord5 dcl_texcoord6 dcl_texcoord7 dcl_texcoord8 dcl_texcoord9 v0.xy v1.xyz v2 v3 v4 v5 v6 v7 v8 v9 //texture coordinates //eye vector //Light vector 1 //Projector texture coordinates //Light vector 2 //Projector texture coordinates //Light vector 3 //Projector texture coordinates //Light vector 4 //Projector texture coordinates
1 2 3 4
After this well-known vector setup (this time without the light vector), we enter the main loop. We go step by step through it. Just before the loop, we reset all registers used in the loop. Register r8 is used to accumulate diffuse intensity while register r7 is used to accumulate specular intensity. Register r6 is used to indicate the loop index and is incremented by one with the last instruction in the loop (add r6, r6, c3.g). The loop works the same way described for the vertex shader, even with the same constant values in our case. The first step in the loop is to load and normalize the current light vector.
106
Later in the loop we compute diffuse intensity. With flow control available, we can do if-then-else constructions, so why not use it to ignore unlit pixels (those where the dot product of n and l is less than 0) with the if_gt x, y instruction standing for if x>y? Now any further computation is done only for lit pixels. In the first four instructions after if_gt, we compute the reflection vector and specular power the same way that we did in the previous version. The following is a computation of light intensities only for lit pixels.
dp3 r0, r11, r10 if_gt r0.r, c3.b mul r1, r0, c3.r mad r1, r1, r11, -r10 dp3_sat r1, r1, r9 pow r0.g, r1.r, c2.r . . . endif //r0 = (n.l) //Light only if (n.l)>0 //r1 = 2*(n.l) //reflectance vector - r1=2(n.l)n-l //r1 = (r.v) //r1 = (r.v)^shi
The next block of code uses predicates to obtain the correct light color. We use the r6 register (the loop counter) to discover this. This is due to the inability of relative addressing of the constants in pixel shader 3.0.
107
mov r3, c4 setp_gt p0, r6.r, c223 (p0.z) mov r3, c5 (p0.y) mov r3, c6 (p0.x) mov r3, c7
In the last section of the loop, we sample a projector texture for this light. Intensities are then modulated with light color, light intensity, and a spotlight texture sample, and then the result is added to the overall intensity.
NOTE We had to use mov r4, v2[aL+1]; texldp r1, r4, s3 instead of a simple texldp r1, v2[aL+1], s3 because it just did not work correctly in the loop it probably was not translated as a dependent read and always returned the same value.
108
After the loop comes the standard stuff; we load a decal texture and combine accumulated intensities to the final lighting for a given pixel as shown below.
texld mul mul mul mad mov r1, v0, s0 r8, r8, c0 r8, r8, r1 r7, r7, c1 r7, r7, r1.a, r8 oC0, r7 //load diffuse texture //multiply overall diffuse with //material diffuse color //modulate it with texture //multiply overall specular with //material specular color //modulate it with gloss map and //and computed diffuse part //output the color
Paraboloid EM: One texture is used (named sphere map because the reflection is stored as seen on a completely reflective sphere). This texture stores information from one hemisphere around the object. Therefore, we can handle reflections only from the hemisphere facing the camera to get expected results. Without shaders, sphere maps are hard to update interactively. Dual paraboloid EM: Two sphere maps are used, so a complete sphere around the object can be covered. Cube map EM: Six textures (or one cube map) are used. Each texture represents one side of a cube around an object. A cube map is easy to update (even single sides of a cube can be updated) and easy to use. It is a native format, and it can be sampled with (x,y,z) coordinates representing vector (x,y,z) from the center of the cube.
109
Figure 2: 2D visualization of paraboloid EM, dual paraboloid EM, and cube map EM (left to right)
Each type of EM can be done per vertex or per pixel; it depends on where we compute the reflection vector in the vertex or pixel shader. In this text, we use cube maps and per-pixel computation of the reflection vector, and the resulting effect is enhanced with the Fresnel term. Versions using pixel shader 1.4 and 2.0 are shown.
Mathematical Background
We all know from Phong shading how to compute lights reflection vector r. Now we do the same, but for view vector v: r = 2(n v)n - v Until now, all computations in the pixel shader were in tangent space because normals were stored that way. Now we are using a cube map, and it has its own space cube map space. Every vector (x,y,z) in that space points to a texel in one of the cube sides. We know that we need to do all of our computations in one space. The simplest way here is a transfer to cube map space; the eye vector can be transformed to this space in the vertex shader and the normal vector in the pixel shader. To do this, we need to prepare a transformation matrix from tangent space to cube space in the vertex shader. In most cases, cube map space is the same as world space (just pass the other matrix to the vertex shader and use it instead of the cube transformation matrix where applicable).
110
We have a matrix that transforms the vector from object space into tangent space formed from tangent t, binormal b, and normal n; lets call it MO_to_T. tx t = y tz 0 bx by bz 0 nx ny nz 0 0 0 0 1
M O _ to_ T
We also have a matrix that transforms vectors from object space into cube space; it is passed into the shader as a constant. Lets call it MO_to_C. With these we can compute tangent-to-cube space transformation MT_to_C. Our two matrices give us the following two equations (note that the subscript part in a vector name indicates the space in which the vector is defined): v tangent = v object * M O_ to_ T v cube = v object * M O_ to_ C By multiplying the first one with MO_to_T1 from the right, we get: v tangent * M O_ to_ T
1
= v object
* M O_ to_ C
* M O_ to_ C
Unfortunately, to use this matrix easily in the pixel shader, we need to transpose it (m4x4, or similar macro instructions, need this). Because MO_to_T is orthogonal (it is a rotation matrix), its transpose is equal to its inverse, and we use this to get the following result:
111
-1
M T_ to_ C = (M O_ to_ T
T
-1
)T
= M O_ to_ C * (M O_ to_ T ) T = M O_ to_ C * M O_ to_ T Luckily, needed matrices are in the required form in the vertex shader. MO_to_C is passed already transposed to the shader, and MO_to_T is created inside the shader the same way that it was for per-pixel Phong shading (even in the transposed form needed for m3x3 macros). We can enhance the resulting reflection with the Fresnel term. It describes the amount of light reflected to the viewer and the amount of light refracted when it strikes a material boundary. The maximum amount of light is reflected when the angle between the surface normal and the eye vector is near p/2, and the minimum is reflected when it is near 0. A good explanation of Fresnel reflection, its equation, and usable approximations is in [5]. We use Schlicks approximation: R(q ) = R( 0 ) + (1 - R( 0 )) * (1 - cos(q )) 5 R( 0 ) = ( n1 - n2 ) 2 ( n1 + n2 ) 2
In previous equations, q is the angle between the eye vector and half vector (between eye and light source). The half vector describes the normal of a surface, which reflects the light ray directly to the eye. In our case, this vector is replaced with a surface normal. R(0) is the Fresnel reflection for zero angle q, n1 is the index of the refraction of material from which light comes (commonly air or vacuum), and n2 is the refraction index of the surface material. For example, a vacuum has a refraction index ri = 1.0, air is ri = 1.000293, water is ri = 1.333333, and diamond is ri = 2.417.
112
Vertex Shader
First we show a version for the DirectX 8 class of hardware, and we begin with the vertex shader. For clearer code and a simpler explanation, version 2.0 of the vertex shader is used (version 1.1 can be extracted by replacing the macro instructions nrm and crs with respective code known from older shaders). The following is a 2.0 vertex shader for environment mapping.
vs_2_0 // Constant registers //-----------------------------// c0-c3 - cube map space transposed (might be world space) // c4-c7 - cube * view * proj // c8 - Eye position (in cube space) // c9 - Adjustment factor for cube map // // Input registers //-----------------------------dcl_position v0 dcl_normal v1 dcl_texcoord v2 dcl_tangent v3 // Output //-----------------------------// oT0 - tex coord // oT1 - eye vector in cube space // oT2 - 1st row of tangent-to-cube matrix // oT3 - 2nd row of tangent-to-cube matrix // oT4 - 3rd row of tangent-to-cube matrix // oT5 - vertex modifier //-----------------------------//Output clip space position, texture coordinates //Compute eye vector in cube space //-----------------------------m4x4 oPos, v0, c4 //vertex clip position mov oT0.xy, v2.xy //Texture coordinates for color texture
113
//Transform vertex into cube map space //Eye vector //Normalize it //Output it
//Create tangent space basis and matrix from tangent to cube space //-----------------------------mov r9, v3 //Copy tangent, then crs r10.xyz, r9, v1 //do cross product to compute binormal and mov r11, v1 //then copy normal - matrix is in r9,r10,r11 m3x3 m3x3 m3x3 oT2.xyz, c0, r9 oT3.xyz, c1, r9 oT4.xyz, c2, r9 //Create transformation matrix (transposed)
The first notable thing is that the eye vector is computed in cube space (eye position is passed to the vertex shader in that space, and a vertex is transformed into it as well). We pass the vector without any further transform to the pixel shader. Later we can create an object-to-tangent space transformation matrix in registers r9, r10, r11, but this time we do not transform the normal and tangent to world space before computation of the binormal. Also note that this matrix is created in transposed form, which is needed for the m3x3 macro. With the last three m3x3 instructions, we are creating the transposed MT_to_C matrix, as described in the previous section, and we output it to the pixel shader. The very last instruction requires a little bit of an explanation. Because for cube map lookup we are using vector (x,y,z) from the center of the cube, two points with the same vector (in our case, the reflection vector) will produce the same lookup result. If we have a group of points with the same normal (plane is a typical example), reflection vectors will be almost the same and reflection for the plane will be only one color. To prevent this, we have to modify the reflection vector so it points to the correct place. Take a look at the following figure:
114
This figure shows that if we add a vector from the center to the vertex position (all in cube space) to the reflection vector, it points to the correct place and starts at the cube center. In real situations, the reflection vector is not guaranteed to end exactly at the cube boundary, as shown above, and therefore the resulting vector will likely point to another place, but the results that are produced are good and acceptable. The last shader instruction prepares this correction vector (remember that the vertex position in cube map space is also a vector from the center to this point) by multiplying it with a scalar constant. This is needed if the cube space transformation is replaced with world space transformation (a very common situation) and (x,y,z) ranges in this space are greater than [11] (valid ranges in the cube map). We compute this tweak ratio (which can be 1/greatest_coordinate_in_world_space) outside the vertex shader to get the vertex into cube space range. Of course, a simple mul can be replaced with the more sophisticated mad (for additional center adjustments).
115
116
t2 t3 t4 t5
//1st row of tangent-to-cube matrix //2st row of tangent-to-cube matrix //3rd row of tangent-to-cube matrix //vector shift
//multiply normal with transform matrix dp3 r3.r, r3, r1_bx2 //transform normal to cube space dp3 r3.g, r4, r1_bx2 dp3 r3.b, r5, r1_bx2 //compute eye reflection vector dp3 r1.rgb, r3, r2 //r1 = dot(normal, eye) mad r4.rgb, r1_x2, r3, -r2 //reflectance vector r2=2(n.e)n-e add r4.rgb, r4, r0 //Shift it phase texld r0, t0 texld r2, r4 texld r3, r1 mul r0.rgb, r0, r1.r lrp_sat r4, c0.r, c0.g, r3 //mul_sat r4, r4, r0.a lrp r0.rgb, r4.a, r2, r0
//diffuse texture(n) //cube map lookup //Fresnel lookup //to simulate diffuse lighting //prepare Fresnel value (with R(0)) //modulate with gloss ratio //compute final color
In the second phase, we read environment reflection from the cube texture at coordinates specified with the reflection vector computed before. Then we use the dot product of the normal and eye vector (computed in the first phase) for lookup into the texture holding one part of the Fresnel reflection approximation (1cos(q))5. We could compute it in the shader, but limited precision will produce even more errors, and lookup to 1D texture is very fast. We use the computed dot product one more time to simulate diffuse light, so the scene wont look flat (this is not the situation in a game, where the lighting is done in a separate pass).
117
The lrp_sat instruction (saturated linear interpolation, lrp dest, src0, src1, src2 means dest = src0*src1 + (1-src0)*src2) is used to finalize the Fresnel equation (after replacing constants, we will get R = R(0)*1.0 + (1R(0)).LookupValue). Then comes modulation with gloss ratio (disabled in this case) and final interpolation between diffuse color and reflection, depending on the value of the Fresnel function.
Figure 5: This image shows a reflection of materials with various indexes of refraction water, flint glass, diamond, and full reflection. Note how the amount of reflection increases with the angle.
118
// Used input texture samplers //-----------------------------dcl_2d s0 //diffuse texture (gloss in alpha) dcl_2d s1 //normal texture dcl_cube s2 //cube map texture // Output //-----------------------------// oC0 - output color // // Set up needed vectors - load and normalize //-----------------------------texld r0, t0, s1 //load normal mad r1, r0, c1.r, -c1.g //bias normal to range -1,1 m3x3 r0.xyz, r1, t2 //transform to cube space nrm r11, r0 //r11 = normalized normal mov r1.xyz, t1 nrm r10, r1 //r10 = normalized eye vector // Compute eye reflection vector //-----------------------------dp3 r9.r, r11, r10 mul r8.r, r9.r, c1.r mad r8.rgb, r8.r, r11, -r10 add r8.rgb, r8, t5
//r9 = dot(normal, eye) //r1 = 2*(n.v) //reflectance vector - r1=2(n.v)n-v //Shift it approximation) F=IR+(1-IR)*(1-(n.l))^5 //r1 = 1 - n.v //r0 = (1 - n.v)^5 //final F = IR*1 + (1 - IR)*r0
// Compute Fresnel term (Schlick's //-----------------------------sub r1.r, c1.g, r9.r pow r0.r, r1.r, c1.b lrp r7.rgb, c0.r, c0.g, r0.r
// Texture lookups and final modulations //-----------------------------texld r0, t0, s0 //diffuse texture texld r1, r8, s2 //cube map lookup
119
r0.rgb, r0, r9.r r7.rgb, r7, r0.a r1.rgb, r7.r, r1, r0 oC0, r1
//to simulate diffuse lighting (n.v) //modulate Fresnel term with gloss //compute final color //output the color
The new pixel shader produces results almost identical to the older one. With direct picture-to-picture comparison, very little shift in position of reflection and slightly stronger reflection at glancing angles can be seen. This is all due to normalization of vectors per pixel, and none is notable in moving the environment of a game.
HLSL Version
These programs are translations of pixel and vertex shaders 2.0 into HLSL. The following HLSL vertex shader is for environment mapping.
// Used input structure //-----------------------------struct VS_INPUT { float4 vPosition : POSITION; float3 vNormal : NORMAL; float2 tcCoord : TEXCOORD; float3 vTangent : TANGENT; };
// Used output structure //-----------------------------struct VS_OUTPUT { float4 vClipPos : POSITION; float2 tcCoord : TEXCOORD0; float3 vEye : TEXCOORD1; float3x3 mToWorld: TEXCOORD2; float3 vAdjust : TEXCOORD5; }; // Constant registers //-----------------------------float4x4 mToCube : register(c0); float4x4 mToClip : register(c4); float4 pEye : register(c8);
//Clipping space position //texture coordinates //eye vector //from tangent to world space //perspective spotlight coordinates
//cube map space transposed //world * view * proj //Eye position (cube space)
120
// function : main // description : vertex shader function // return : VS_OUTPUT // param: // VS_INPUT input : vertex shader input //-----------------------------VS_OUTPUT main(const VS_INPUT input) { VS_OUTPUT output; //Following code outputs position and texture coordinates //-----------------------------output.vClipPos = mul(input.vPosition, mToClip); output.tcCoord = input.tcCoord; float4 pVertexCube = mul(input.vPosition, mToCube); //To cube space output.vEye = normalize(pEye - (float3)pVertexCube); //Create tangent space basis and matrix from tangent to cube space //-----------------------------float3x3 mToTangent; mToTangent[0] = input.vTangent; mToTangent[2] = input.vNormal; mToTangent[1] = cross(mToTangent[0], mToTangent[2]); //binormal NxT output.mToWorld = mul(mToTangent, mToCube); output.vAdjust = pVertexCube * Adjustment; //vector adjustment return output; }
//texture coordinates //eye vector //from tangent to world space 1 //perspective spotlight coordinates
121
//render target 0
// Used input texture samplers //-----------------------------sampler smplTexture : register(ps,s0); //decal texture sampler smplNormal : register(ps,s1); //normal texture sampler smplCube : register(ps,s2); //cube map texture // function : main // description : pixel shader function // return : PS_OUTPUT // param: // PS_INPUT input : pixel shader input - output from VS // float refindex: c0 - R(0) for Fresnel term //-----------------------------PS_OUTPUT main(const PS_INPUT input, uniform float refindex : register(c0)) { PS_OUTPUT output; //load and normalize //-----------------------------float3 vNormal = tex2D(smplNormal, input.tcCoord).xyz; vNormal = 2.0 * vNormal - 1.0; //bias to [-1...1] vNormal = mul(vNormal, input.mToWorld); //to world space vNormal = normalize(vNormal); float3 vEye = normalize(input.vEye); //compute adjusted eye reflection vector //-----------------------------float eDotN = dot(vEye, vNormal); float3 vEyeReflected = 2* eDotN * vNormal vEye + input.vAdjust; float4 cube = texCUBE(smplCube, vEyeReflected); float4 color = tex2D(smplTexture, input.tcCoord); float Fresnel = lerp(pow(1 - dot(vNormal, vEye), 5), 1, refindex); Fresnel = Fresnel * color.a; //reflection only on shiny parts float4 diffuse = color * dot(vNormal, vEye); //diffuse simulation output.vColor = lerp(diffuse, cube, Fresnel); return output; }
122
Spherical Coordinates
Since lighting equations are more about directions (light and view vectors, normal vectors, etc.) than positions, it is often better to describe a vector not in Cartesian coordinates v = (vx, vy, vz) but with a pair of angles q (polar or elevation) and f (azimuth) and length of vector. A polar angle is an angle between a vector and one base vector, and an azimuth angle is an angle between a vector projected into a plane defined by the remaining two base vectors and one of these vectors. This is good for a description of the lighting equation vectors, since they are normalized, so we can just ignore the length parameter. In the case of lighting, base vectors will almost always be n, t, and b, polar angles will be computed in respect to normal n, and azimuth angles are relative to the tangent t. The following figure shows this situation.
123
The relationship between Cartesian (vx, vy, vz) and spherical (qv, fv) coordinates for normalized vector v is shown in the following equation: v x = cos(f v ) * sin(q v ) v y = sin(f v ) * sin(q v ) v z = cos(q v ) v f v = arctan x v y vz = arccos vx 2 + vy 2 + vz 2 = arccos( v z )
v 2 +v 2 y x q v = arctan vz
It is obvious that the dot product between the normal and light (or eye) vector is nothing more than a cosine of a polar angle.
Roughness of a Surface
Both of the models presented use a micro-facet model to simulate structure (roughness) of a surface every piece of surface is composed of tiny facets. The roughness model is called a v-cavities model because the structure of the surface is modeled with cavities in the shape of a V It was introduced by Torrance-Sparrow . in [6]. While facets in every model have other properties, there is something in common. The resulting intensity of a surface piece depends on the sum of the facet intensities. The importance of this approach is shown in Figure 7 (taken from the later Oren-Nayar model), which shows that the surface composed of totally Lambertian micro-facets (independent of the position of a viewer) is not Lambertian when viewed from a distance where several micro-facets are covered by one pixel.
124
Figure 7: The brightness of a pixel that covers a surface patch with Lambertian micro-facets depends on the viewers position. On the left, the viewer sees brighter facets and the resulting pixel is brighter. On the right, the pixel is darker.
Due to the complexity of the computation, actual rendering does not use real facets. Instead, models use a probability distribution function that predicts the approximate number of facets with a specific normal (heading the specific way). These functions are described later with a respective model.
125
The equation for computing the masking-shadowing term (or GAF for geometric attenuation factor) is: 2( n h)( n v ) 2( n h)( n l) , GAF = min 1, ( v h) ( v h) The h vector has the same meaning that it had previously in the Fresnel term equation. GAF is used as an attenuator for intensity of a fully lit facet or surface patch.
Figure 9: Detail of a rough surface. On the left, the viewer sees more of the brighter facets and the resulting pixel is brighter. On the right, the pixel will be darker.
126
The O-N model assumes that the surface consists of tiny microfacets, each perfectly Lambertian. This model takes into account masking and shadowing of facets and also adds an interreflection factor light bouncing between adjacent faces. The model uses Gauss distribution function DGauss with zero mean and standard deviation s to describe the roughness of the surface. We use s as a roughness parameter. If it is zero, all facet normals are aligned with a surface normal. The greater s is, the deeper the cavities are. Here you can see the Gauss distribution function equation, where C is the normalization constant and q is the polar angle of the facet normal with respect to the surface patch normal: DGauss = C * e
q2 s2
Overall brightness of a surface patch is integral to the intensities of all its facets some masked by others, some shadowed, and some lit by the reflection from other ones. To compute this integral in real time, we have to wait for a new generation of hardware. The authors of this model (knowing its complexity) simplified it into a single equation (interested readers should look into the original papers), which is unfortunately still far from usability in game-oriented computer graphics applications. But authors simplified it even more, ignoring the interreflection factor and terms contributing only little to the final intensity. Now the hardware is powerful enough, and we can use it in the pixel shader for real-time lighting and shading. Here is a simplified O-N equation: I O - N = cos(q L ) * ( A + B * max( 0, cos(f V - f L )) * sin(a ) * tan(b)) s2 s 2 + 0. 33 s2 B = 0. 45 2 s + 0. 09 a = min(q L , q V ) b = max(q L , q V ) A = 1 - 0.5
127
In the previous equation, qL means the polar angle for the light view vector, and qV means the polor angle for the view angle according to the surface normal. fL, fV are azimuth angles for the light and view vector, respectively, according to the tangent.
Shaders
In this section we implement the Oren-Nayar lighting model with shaders version 2.0. We show and describe only the pixel shader because the vertex shader is identical to the one shown in the Phong shading section at the beginning of this article. The following is the pixel shader for Oren-Nayar lighting.
ps_2_0 // Constant registers //-----------------------------// c0 - roughness (R) (should be redefined in material) def c31, 1.0f, 2.0f, 0.5f, 0.33f //useful constants def c30, 0.45f, 0.09f, 0.0f, 0.0f //useful constants def c29, 0.0f, 1.0f, 2.0f, 3.0f //useful constants // Used input registers //-----------------------------dcl t0.xy //texture coordinates dcl t1.xyz //light vector dcl t2.xyz //eye vector // Used input texture samplers
128
//-----------------------------dcl_2d s0 //diffuse texture (gloss in alpha) dcl_2d s1 //normal texture dcl_2d s2 //lookup // Output //-----------------------------// oC0 - output color // Load and normalize input vectors //-----------------------------texld r0, t0, s1 //load normal mad r1, r0, c31.g, -c31.r //bias normal to range -1,1 nrm r11, r1 //r11 = normalized normal mov r1.xyz, t1 nrm r10, r1 //r10 = normalized light vector mov r1.xyz, t2 nrm r9, r1 //r9 = normalized eye vector // A = 1 - 0.5 * R^2 / (R^2 + 0.33) //-----------------------------mul r0, c0.r, c0.r //R^2 add r1, r0, c31.a //R^2 rcp r1, r1.r //1 / mul r0, r1.r, r0 //R^2 mad r8, r0.r,-c31.b, c31.r //1 -
// B = 0.45 * R^2 / (R^2 + 0.09) //-----------------------------mul r0, c0.r, c0.r //R^2 add r1, r0, c30.g //R^2 + 0.09 rcp r1, r1.r //1 / (R^2 + 0.09) mul r0, r1.r, r0 //R^2 / (R^2 + 0.09) mul r7, r0.r, c30.r //r8 = 0.45 * R^2 / (R^2 + 0.09) // CX = Max(0, cos (l',v')) //-----------------------------dp3 r1, r10, r11 //these four instructions are projecting the mul r1, r11, r1 //light vector to the plane defined by T and B sub r1, r10, r1 //equation is : l' = normalize(l - n * (n.l)) nrm r0, r1 //in our case r0 = normalize(r10 - r9 * r1) dp3 r2, r9, r11 //these four instructions are projecting the
129
r2, r11, r2 r2, r9, r2 r1, r2 r6, r0, r1 r6, r6, c29.r
//eye vector to the plane defined by T and B //equation is : v' = normalize(v - n * (n.v)) //in our case r1 = normalize(r11 - r9 * r2) //(l'.v') = (r0.r1) //only positive values
// DX = texture lookup for sin(a)*tan(b); a=max(O-r,O-i); b=min(O-r,O-i) //-----------------------------dp3 r1.x, r10, r11 //n.l dp3 r1.y, r9, r11 //n.v texld r0, r1, s2 //look up mov r5, r0.r // complete it - A + B*CX*DX //-----------------------------mul r0, r5, r6 //CX*DX mul r0, r0, r7 //B*CX*DX add r4, r0, r8 //A + B*CX*DX // Load texture, compute diffuse part and combine it all to output //-----------------------------dp3_sat r1, r11, r10 //n.l texld r0, t0, s0 //load diffuse and specular texture mul r0, r0, r1 //compute diffuse texture mul r0, r0, r4 //modulate by A + B*CX*DX mov oC0, r0
Right after setup and renormalization of vectors, we compute the A and B parameters of the equation. Because these depend only on the roughness parameter and are constant for the shader, they should be computed outside it in a real game situation to gain some speed. Then we compute max(0, cos(fV fL)). To do this, we project eye and light vectors to the plane described by t and b using the next equation: Vprojected=vn(nv) This can also be done by resetting their z component, but we will lose the bump map in that case. After renormalization of both modified vectors (v' and l'), we can replace cos(fV fL) with
130
cos(v' l'). To see why this works, take a look at Figure 6 in the section titled Spherical Coordinates. After this, we compute the sin(a)*tan(b) part. We know that we can easily compute cosines of polar angles (dot product with normal). Sine or tangent is a bit more difficult to do with the constraints of shaders 2.0 because we would need to do arccos (we do not have enough instruction slots), or we would need to replace sin and tan with cosines. To solve this, we transfer the entire computation to the 2D lookup function, where the x coordinate is the cosine of qL (n l) and y is the cosine of qV (n v). Textures accept only coordinates from range [01], but this is not a problem because qL and qV are both less than p/2 and positive. (qV is always less than p/2 because we render only pixels facing toward the camera. Cases where qL is greater than p/2 can be ignored, as in this case the cos(qL) at the beginning of the equation will reset the whole computation to zero.) So both dot products will be in the range [01]. Here, the texture function can use arccos to restore angles from inputs. Then the maximum and minimum are chosen and sin(a)*tan(b) is computed. Because the result of this function could not be in the range [01], we use a new floating-point texture format only one channel (red) with full 32-bit float precision.
Figure 11: Lookup texture for sin(a)*tan(b). The result is divided by 9.0 to show greater range.
131
Later in the shader we complete these contributions and modulate the result with the standard Lambertian diffuse and color texture. Due to its complexity, it is hard to port this shader into the pixel shader 1.4 version. However, this could be done if we replaced azimuth angles with respective polar angles everywhere in computation. Then we could use one 3D lookup texture with three parameters (nl, nv, roughness) and compute all the lighting with one dependent texture lookup.
Figure 12: A vase with various roughness values 0.0, 0.3, 0.6, and 1.0 (left to right). If roughness is 0, the model is identical to Lambertian. Note how the vase gets flatter with increased roughness.
HLSL Version
The following HLSL vertex shader is for Oren-Nayar lighting.
// Used input structure //-----------------------------struct VS_INPUT { float4 vPosition : POSITION; float3 vNormal : NORMAL; float2 tcCoord : TEXCOORD; float3 vTangent : TANGENT; };
// Used output structure //-----------------------------struct VS_OUTPUT { float4 vClipPos: POSITION; //Clipping space position float2 tcCoord : TEXCOORD0; //texture coordinates float3 vLight : TEXCOORD1; //light vector
132
float3 vEye };
// Constant registers //-----------------------------float4x4 mToWorld : register(c0); float4x4 mToClip : register(c4); float4 pLight : register(c8); float4 pEye : register(c9);
//world space transposed //world * view * proj //Light position (world space) //Eye position (world space)
// function : main // description : vertex shader function // return : VS_OUTPUT // param: // VS_INPUT input : vertex shader input //-----------------------------VS_OUTPUT main(const VS_INPUT input) { VS_OUTPUT output; //The following code outputs position and texture coordinates //-----------------------------output.vClipPos = mul(input.vPosition, mToClip); output.tcCoord = input.tcCoord; float4 pVertexWorld = mul(input.vPosition, mToWorld); //The following code generates tangent space base vectors //-----------------------------float3x3 mToTangent; mToTangent[0] = mul(input.vTangent, (float3x3)mToWorld); mToTangent[2] = mul(input.vNormal, (float3x3)mToWorld); mToTangent[1] = cross(mToTangent[0], mToTangent[2]); //Compute light and eye vectors //-----------------------------float3 vToLight = normalize(pLight - pVertexWorld); output.vLight = mul(mToTangent, vToLight); float3 vToEye = normalize(pEye - pVertexWorld); output.vEye = mul(mToTangent, vToEye); }
133
134
float A = 1.0f - 0.5f * roughness2 / (roughness2 + 0.33f); float B = 0.45f * roughness2 / (roughness2 + 0.09f); // CX = Max(0, cos (l',v')) //-----------------------------float normalDotLight = dot(vNormal, vLight); float3 vLightProjected = normalize(vLight - vNormal*normalDotLight); float normalDotEye = dot(vNormal, vEye); float3 vEyeProjected = normalize(vEye - vNormal*normalDotEye); float CX = saturate(dot(vLightProjected, vEyeProjected)); // DX = texture lookup for sin*tan //-----------------------------float2 tcLookup = {normalDotLight, normalDotEye}; float DX = tex2D(smplLookUp, tcLookup); // completize it - (n.l)*texture*(A + B*CX*DX) //-----------------------------output.vColor = saturate(normalDotLight)* tex2D(smplTexture, input.tcCoord)*(A+B*CX*DX); return output; }
Cook-Torrance Model
This model was published in 1981 [9] and is based on the Torrance-Sparrow model from 1967 [6]. The Cook-Torrance (C-T) model is often used to evaluate specular highlights of metals and plastics. It is a physically based method and surpasses Phongs model because it was developed using measured data from real materials and uses physically measurable factors, such as energy and wavelength. The Cook-Torrance model uses:
n n
A micro-facet model for surface roughness Fresnels equation to compute the amount of reflection and color shift of highlight
135
The geometric attenuation factor for micro-facet self shadowing and masking
The specular highlight usually has the color of the material and not the color of the light. The Fresnel equation predicts a color shift of the specular component at glancing angles. Some types of materials (painted objects and plastics) have specular and diffuse components that do not have the same color.
We use Beckmans distribution function DBeckman for surface roughness. In the following equation, m stands for surface roughness and q is the angle between the normal and half vector (cos(q)=(nh)). D Beckman
- tan q 2 1 = 2 e m 4 m cos q
2
DBeckman can be changed into a more shader-friendly form with well-known goniometric expressions:
( sin tan 2 q = cos2 q = 1 - cos q q = 1 = nnh h2) q cos2 ( )
2 2 2
Because we are using Schlicks approximation of Fresnels equation, which works for non-polarized lights, we have to ignore color shifts in specular highlights. The following equation represents a specular component of the Cook-Torrance lighting model. F stands for Fresnel term, GAF for geometric attenuation factor, and D for distribution function. IC-T = F * GAF * D p( n l )( n v)
136
Note that (nl) is ignored in our shader because the previous equation has a form used for BRDF, where computed intensity (of every used lighting model) is multiplied with (nl). In our case, there is no need to multiply and divide by the same term.
Figure 13: An illustration of the Cook-Torrance equation. The intensity range is from 0 (black) to 3 (white) to show how D contributes to the final lighting.
We attempt to port this model to Direct3D first with 2.0 shaders and then with 1.4 shaders.
Shaders 2.0
The vertex shader for Direct3D 9 is almost the same as the shader for Phong shading, except we are also computing the half vector with the last instructions. The following is the vertex shader 2.0 for Cook-Torrance lighting.
vs_2_0 // Constant registers //-----------------------------// c0-c3 - world space transposed // c4-c7 - world * view * proj // c8 - Light position (in world space)
137
// c9
// Input registers //-----------------------------dcl_position v0 dcl_normal v1 dcl_texcoord v2 dcl_tangent v3 // Fixed temporary registers //-----------------------------// r9, r10, r11 - tangent space basis // r8 - vertex world position // Output //-----------------------------// oT0 - texture coordinates // oT1 - Light vector (in tangent space) // oT2 - eye vector (in tangent space) // oT3 - half vector (in tangent space) //The following code will output position and texture coordinates //-----------------------------m4x4 oPos, v0, c4 //vertex clip position mov oT0.xy, v2.xy //Texture coordinates for color texture m4x4 r8, v0, c0 //Transform vertex into world position //The following code generates tangent space base vectors //-----------------------------m3x3 r11.xyz, v1, c0 //transform normal N to world space mov r11.w, v1.w m3x3 r9.xyz, v3, c0 //transform tangent T to world space mov r9.w, v3.w crs r10.xyz, r9, r11 //The cross product to compute binormal NxT //Computes light, eye, and half vectors //-----------------------------add r0, c8, -r8 //Build the light vector nrm r6, r0 //normalize vector m3x3 oT1.xyz, r6, r9 //transform vector into tangent space add r0, c9, -r8 nrm r7, r0 //Build the eye vector //normalize vector
138
m3x3 oT2.xyz, r7, r9 add r0, r6, r7 nrm r1, r0 m3x3 oT3.xyz, r1, r9
//transform vector into tangent space //build the half vector between light and eye //normalize vector //transform vector into tangent space
All of the lighting is done with the pixel shader. The entire shader is just a computation of previously specified equations. Everything can be seen from the comments, and therefore no further explanation is needed. The following is the pixel shader for Cook-Torrance lighting.
ps_2_0 // Constant registers //-----------------------------def c0, 2.71828182845904523536028747135266f, //e 3.14159265358979323846264338327950f, //pi 4.0f, //useful constants 1.0f //useful constants def c1, 5.0f, 2.0f, 1.0f, 0.0f //useful constants // c2 - roughness (should be redefined in material) // c3 - refraction index (should be redefined in material) // Used input registers //-----------------------------dcl t0.xy //texture coordinates dcl t1.xyz //light vector dcl t2.xyz //eye vector dcl t3.xyz //half vector between light and eye // Used input texture samplers //-----------------------------dcl_2d s0 //diffuse texture (gloss in alpha) dcl_2d s1 //normal texture // Output //-----------------------------// oC0 - output color // Load and normalize input vectors //-----------------------------texld r0, t0, s1 //load normal
139
c1.g, -c1.b t1
//bias normal to range -1,1 //r11 = normalized normal //r10 = normalized light vector
// Compute Beckman's distribution function // D = (1 / m^2*cos(A)^4) * e^(-tan(A) / m^2) //-----------------------------dp3 r1, r11, r8 //n.h mul r1, r1.r, r1.r //x = (n.h)^2 mul r2, c2.r, c2.r //y = m^2 mul r3, r2.r, r1.r //z = m^2 * (n.h)^2 sub rcp mul pow mul rcp mul r4, r5, r4, r5, c0.a, r1.r r3.r r4.r, r5.r c0.r, -r4.r //1-x //1 / z //(1-x) / z // pow(e, -(1-x) / z) //z*x //1/(z*x) //r1 will hold final D
// Compute Fresnel term (Schlick's approximation) // F = IR + (1-IR)*(1 - (n.l))^5 //-----------------------------dp3 r3, r11, r9 //n.v sub r3, c0.a, r3.r //1 - n.v pow r3, r3.r, c1.r //(1 - n.v)^5 lrp r2, c3.r, c3.g, r3.r //r2 will hold final F // Compute self shadowing term // G = min(1, X*(n.l), X*(n.v)); X = 2*(n.h) / (v.h) //-----------------------------dp3 r3, r11, r8 //n.h dp3 r4, r9, r8 //v.h mul r3, r3.r, c1.g //2.(n.h) rcp r5, r4.r //1 / (v.h) mul r3, r3.r, r5.r //X = 2.(n.h) / (v.h)
140
r11, r10 r11, r9 r4.r, r3.r r5.r, r3.r r4.r, r5.r r4.r, c0.a
//n.l //n.v //second parameter of G : X*(n.l) //third parameter of G : X*(n.v) //min of second and third parameters //min of previous and 1. We have final G
// Compute denominator part of lighting equation - 1 / (n.v)*pi //-----------------------------dp3 r5, r11, r9 //n.v mul r5, r5.r, c0.g //(n.v)*pi rcp r4, r5.r //r4 = 1 / (n.v)*pi // Compute final Cook-Torrance specular term - (1 / (n.v)*pi) * D*F*G //-----------------------------mul r5, r1.r, r2.r //D.F mul r5, r5.r, r3.r //D.F.G mul r5, r5.r, r4.r //D.F.G / ((n.v)*pi) // Load texture, compute diffuse part, and combine it all to output //-----------------------------dp3_sat r1, r11, r10 //n.l texld r0, t0, s0 //load diffuse and specular texture mul r2, r0, r0.a //modulate texture with gloss map mul r1, r0, r1 //compute diffuse texture mad r0, r2, r5.r, r1 //compute specular + diffuse mov oC0, r0
Shaders 1.4
Here we attempt to port the Cook-Torrance specular highlight to the Direct3D 8.1 class of hardware. Only the pixel shader is described because the vertex shader is very similar to version 2.0. Due to the very low precision of the pixel shader and only eight possible instructions per phase, we transferred the entire calculation to lookup textures. The first 2D texture stores the Fresnel part of the Cook-Torrance lighting equation divided by (nv): F ( n v, R( 0 )) = R( 0 ) + (1 - R( 0 ))(1 - n v ) 5 n v
141
By multiplying the results from these functions, we get an almost complete C-T specular equation, except for the geometric attenuation term (we ignore it here). One problem is that these functions are greater than 1 for some parameters, and so we have to store as much of their range as possible. We do this by storing [01] in the red channel and then subtracting by one and storing the remainder (range [12]) in the green channel. We do something similar for the blue and alpha channels. In the pixel shader, we are able to restore function to range [04].
Figure 14: The image on the left is a visualizization of the Fresnel texture; on the right is the Beckman texture.
In the first phase of the shader, we prepare coordinates for lookup textures. Additionally, we compute the diffuse part of shading because in the second phase we do not have enough instructions left. In the second phase, we sample both functions and unpack intensity from color channels by summing them in a red component. After that, we compute the final color value. More clever pack/unpack is also possible; the red channel will hold range
142
[01], and the green will hold [12]. The blue channel range [24] and the alpha range [46] will be stored. With the _x2 register modifier, we can multiply the register value by two without increasing the instruction count, and the range will then be [06]. The following is pixel shader 1.4 for Cook-Torrance lighting.
ps_1_4 // Constant registers //-----------------------------// c2 - vector in form (roughness, 1.0, 1.0, 1.0) // c3 - vector in form (refraction_index, 1.0, 1.0, 1.0) // Used input registers //-----------------------------// t0 - texture and normal coordinates // t1 - light vector // t2 - eye vector // t3 - half vector // Used input texture stages //-----------------------------// stage0 - diffuse texture // stage1 - normal texture // stage2 - f(n_dot_h,roughness) = Beckman(n_dot_h,roughness)/pi // stage3 - f(n_dot_l,RI) = Fresnel(n_dot_v, RI)/n_dot_v
// Output //-----------------------------// r0 - output color // Sample normal texture and load vectors from input //-----------------------------texld r0, t0 //load texture (gloss map in alpha) texld r1, t0 //normal vector (n) texcrd r2.rgb, t1.xyz //Light vector (l) texcrd r3.rgb, t2.xyz //eye vector (v) texcrd r4.rgb, t3.xyz //half vector (h) // Compute lookup texture coordinates
143
//-----------------------------dp3 r5.rgb, r1_bx2, r2 //n.l - for diffuse part dp3_sat r2.rgb, r1_bx2, r4 //n.h - first parameter of Beckman lookup mov r2.g, c2.r //roughness (M) - 2nd parameter of lookup dp3_sat r3.rgb, r1_bx2, r3 //n.v - for Fresnel equation mov r3.g, c3.r //index of refraction 2nd lookup parameter mul r1.rgb, r0, r5 mul r4.rgb, r0, r0.a //diffuse lighting //modulate texture with gloss
// 2nd phase - Sample diffuse texture and lookup in texture functions //-----------------------------phase texld r0, t0 //load texture (gloss map in alpha) texld r2, r2 //Beckman distribution (B) texld r3, r3 //Fresnel lookup (F) // Expand Beckman and Fresnel to range [0...4] from RGB channels //-----------------------------add r2.r, r2.r, r2.g //unpack - R+G add r2.g, r2.r, r2.b //R+G+B add r2.r, r2.r, r2.g //R+G+B+A add r3.r, r3.r, r3.g add r3.r, r3.r, r3.b add r3.r, r3.r, r3.a mul r2.r, r2.r, r3.r mad r0.rgb, r4, r2.r, r1 //unpack - R+G //R+G+B //R+G+B+A //I(spec) = B * F //spec_color * I(spec) + diffuse
HLSL Version
The following is the HLSL vertex shader for Cook-Torrance lighting.
// Used input structure //-----------------------------struct VS_INPUT { float4 vPosition : POSITION; //position in object space float3 vNormal : NORMAL; //normal float2 tcCoord : TEXCOORD; //texture coordinates
144
float3 vTangent : TANGENT; //tangent }; // Used output structure //-----------------------------struct VS_OUTPUT { float4 vClipPos: POSITION; //Clipping space position float2 tcCoord : TEXCOORD0; //texture coordinates float3 vLight : TEXCOORD1; //light vector float3 vEye : TEXCOORD2; //eye vector float3 vHalf : TEXCOORD3; //half vector }; // Constant registers //-----------------------------float4x4 mToWorld : register(c0); float4x4 mToClip : register(c4); float4 pLight : register(c8); float4 pEye : register(c9);
//world space transposed //world * view * proj //Light position (world space) //Eye position (world space)
// function : main // description : vertex shader function // return : VS_OUTPUT // param: // VS_INPUT input : vertex shader input //-----------------------------VS_OUTPUT main(const VS_INPUT input) { VS_OUTPUT output; //The following code outputs position and texture coordinates //-----------------------------output.vClipPos = mul(input.vPosition, mToClip); output.tcCoord = input.tcCoord; float4 pVertexWorld = mul(input.vPosition, mToWorld); //The following code generates tangent space base vectors //-----------------------------float3x3 mToTangent; mToTangent[0] = mul(input.vTangent, (float3x3)mToWorld); mToTangent[2] = mul(input.vNormal, (float3x3)mToWorld); mToTangent[1] = cross(mToTangent[0], mToTangent[2]); //Compute light, eye, and half vectors
145
//-----------------------------float3 vToLight = normalize(pLight - pVertexWorld); vToLight = mul(mToTangent, vToLight); output.vLight = vToLight; float3 vToEye = normalize(pEye - pVertexWorld); vToEye = mul(mToTangent, vToEye); output.vEye = vToEye; output.vHalf = normalize(vToLight + vToEye); return output; }
146
PS_OUTPUT main(const PS_INPUT input, uniform float roughness : register(c2), uniform float refindex : register(c3)) { PS_OUTPUT output; // Load and normalize input vectors //-----------------------------float3 vNormal = tex2D(smplNormal, input.tcCoord).xyz; vNormal = normalize(2.0f * vNormal - 1.0f); float3 vLight = normalize(input.vLight); float3 vEye = normalize(input.vEye); float3 vHalf = normalize(input.vHalf); // Beckman's distribution function D //-----------------------------float normalDotHalf = dot(vNormal, vHalf); float normalDotHalf2 = normalDotHalf * normalDotHalf; float roughness2 = roughness * roughness; float exponent = -(1-normalDotHalf2) / (normalDotHalf2*roughness2); float e = 2.71828182845904523536028747135f; float D = pow(e,exponent) / (roughness2*normalDotHalf2*normalDotHalf2); // Compute Fresnel term F //-----------------------------float normalDotEye = dot(vNormal, vEye); float F = lerp(pow(1 - normalDotEye, 5), 1, refindex); // Compute self shadowing term G //-----------------------------float normalDotLight = dot(vNormal, vLight); float X = 2.0f * normalDotHalf / dot(vEye, vHalf); float G = min(1, min(X * normalDotLight, X * normalDotEye)); // Compute final Cook-Torrance specular term // Load texture, compute diffuse part, and combine it all to output //-----------------------------float pi = 3.1415926535897932384626433832f; float CookTorrance = (D*F*G) / (normalDotEye * pi); float4 color = tex2D(smplTexture, input.tcCoord); float4 specular = color * max(0.0f, CookTorrance) * color.a;
147
float4 diffuse = color * max(0.0f, normalDotLight); output.vColor = diffuse + specular; return output; }
Quality Comparison
From the following figure it is clear that the older version cannot compete with the newer one in terms of quality. On the other hand, with low roughness and high refraction, index results are much better than with the Phong shading version 1.4. Also note that the lack of geometry attenuation factor in the pixel shader 1.4 version causes errors in the intensity computation for high roughness areas that are supposed to be dark are brighter. This can be seen below.
Figure 15: Rendering with various refraction index values with pixel shader 1.4 (top row) and pixel shader 2.0 (bottom row). Roughness is constant at 0.15. The index of refraction is 0.15, 0.45, and 0.85 (left to right). Note the visibility of the face edges and error (crack) in the middle of the large highlights in the 1.4 version. (See Color Plate 1.)
148
Figure 16: Rendering with various roughness values with pixel shader 1.4 (top row) and pixel shader 2.0 (bottom row). The refraction index is constantly 1.0 (full reflection). Roughness is 0.1, 0.2, 0.4, 0.6, 0.8, and 1.0 (left to right).
Conclusion
Previous examples showed that Direct3D 9 shaders are a great step toward more reality in games. What was not possible to put into a shader before (or was possible only with hacks and compromises) can now be done in full quality and in a single pass. The main advantage is floating-point precision in pixel shaders. Vertex shaders 2.0 are great for games. With a huge instruction count and flow control, we can use one or two of them for a whole scene with features like per-vertex lighting and mesh skinning available via functions. Due to per-vertex lighting, there is no longer a need to switch the shader and do multipass rendering. Pixel shader 2.0 architecture makes it possible to perform advanced lighting models per pixel. Sixty-four arithmetic and 32 texture instructions should be enough for most games to keep the number of required passes to the minimum. With shadows added into the scene, one light will still require at least one pass to render (the most common is two). The available instruction count can be used to render shadows for one light and some other per-pixel lights without shadows in one pass. Shaders 3.0 are great in functionality, but to me the pixel shader is a bit unfinished because it lacks relative addressing of
149
constant registers. With this enabled, an arbitrary number of lights could be computed per pixel (in world space, without shadow) in one pass thanks to loops (this can be done now with loop unrolling in design time, but it is not very handy). Also, a clear border could be defined between pixel and vertex shaders vertex shaders for geometry transformation and pixel shaders for visualization. Today, with flow control and a huge instruction count, there can be one pixel shader for the whole scene with all required features available when needed. Support for the shaders 2.x model in games is questionable. With capabilities changing from card to card, the game can use the shaders full potential only with some sort of run-time linker that merges fragments of prepared code, according to current device possibilities.
References
[1] Valient, M., Project6 - Lighting, November 2002, http://www.dimension3.host.sk. [2] Mller, T. and E. Haines, Real-Time Rendering, second edition, A.K. Peters, Natick, Massachusetts, 2002, http://www.realtimerendering.com. [3] Taylor, Philip, Per-Pixel Lighting, MSDN article, November 2001, http://msdn.microsoft.com/directx. [4] Kilgard, Mark J., A Practical and Robust Bump-mapping Technique for Todays GPUs, nVidia Corporation paper, March 2000, http://developer.nvidia.com. [5] Wloka, Matthias, Fresnel Reflection Technical Report, nVidia Corporation paper, June 2002, http://developer.nvidia.com. [6] Torrance, K. and E. Sparrow, Theory for off-specular reflection from roughened surfaces, Journal of the Optical Society of America, September 1967, 57:1105-1114.
150
[7] Oren, Michael and Shree K. Nayar, Generalization of Lamberts Reflectance Model, Computer Graphics (SIGGRAPH 94 Proceedings) 1994, pp. 239-246, http://www.cs.columbia.edu/ ~oren/. [8] Oren, Michael and Shree K. Nayar, Generalization of the Lambertian Model and Implications for Machine Vision, International Journal of Computer Vision, Vol. 14:3. [9] Cook, Robert, L. Torrance, and E. Kenneth, A reflectance model for computer graphics, Computer Graphics (SIGGRAPH 81 Proceedings) July 1981, Vol. 15:3, pp. 307-316.
Introduction
Environmental fog is an effect that has been used in numerous games. Its popularity results from the fact that a great way to increase the realism of a scene is by adding an additional layer of ambience. This physical phenomenon itself is caused by dust and other particles suspended in the atmosphere, thereby scattering incoming light and emitting it back into the scene. Fog effects have become essential for depth culling, especially for large outdoor scenes, as it reduces the amount of popping caused by distant objects entering the scene at the far clipping plane. Fog is a relatively cheap calculation, well supported by todays graphics hardware. It has become more and more a vital part of the gameplay. It allows creating different areas of visibility that can be easily incorporated into line-of-sight and hiding strategies. In this article we start by taking a look at the standard fog effects: linear, exponential, and exponential squared fog. We create vertex and pixel shader equivalents for the fog calculations of the fixed-function pipeline (FFP). Although the shader versions are quite simple, they may help you develop your own variations and modifications of fog, which could result in the creation of some more interesting effects.
151
152
After dealing with these basic types, we take a look at layered fog. Layered fog is used to simulate ground fog and effects like smog, smoke, or clouds. Finally we discuss a technique for real-time animated fog, which can be used to create non-uniform fog density over scenes and time. All examples are based on DirectX but are applicable to other APIs as well.
Imagine a ray of light starting at a point in the scene and traveling toward the virtual camera. In a scene where no fog is present, the intensity of light along this ray is determined by the light intensity at the starting point. In a foggy scene, the light intensity that reaches the virtual camera is influenced by absorption (light that is blocked by particles along the ray), out-scattering (light that is reflected toward other directions), and in-scattering (lights from other directions that are reflected along the ray). This influence is represented by the density factor g in our calculations. Rays starting at points farther away from the camera have a far higher chance of getting influenced by fog particles than rays starting closer to the camera. So we have to take the distance d to the camera into account.
153
The final color, received at the camera and also written to the frame buffer, is determined by blending between the color of a scene point (Ccurrent) and the color of the fog (Cfog). The used blend factor (f) has to be in the range [0.0, 1.0], meaning that a blend factor of 1.0 results in no fog influence (e.g., for points near to the camera), and a factor of 0.0 results in full fog influence (e.g., for points far in the distance). Mathematically, this can be expressed by: Cfinal = (f * Ccurrent) + ((1f) * Cfog)) ...where Cfinal represents the final color written to the frame buffer. Depending on the equation used to determine the factor f, various results can be achieved, both visually and with respect to performance issues. These also apply to determining the fog color Cfog. It can be constant for all rendered points or determined by additional calculations. There are two commonly used approaches for determining the distance factor (d): plane-based and range-based.
n
Plane-based: The camera space z-depth of a vertex/pixel is used as the distance factor. It is very cheap. Range-based: The exact distance between a vertex/pixel and the virtual camera is calculated. This can be done in either world space or camera space. It is more expensive, since this usually requires the evaluation of a square root.
Both ways of choosing d are used in the techniques described in the following sections.
154
Fog Equation
f= ( Z FogEnd - Z Depth ) (Z FogEnd - Z FogStart )
Linear fog is the simplest calculation of all discussed methods. It assumes an equal distribution of fog particles throughout the scene. Generally, you specify a starting value (ZFogStart) and an ending value (ZFogEnd), which control the way fog is applied to the scene. For vertices/pixels that are between these boundaries, a blend value is obtained by a linear interpolation. The amount of fog present in the scene constantly increases with the distance to the virtual camera.
155
Implementation
Shader
For the implementation of linear fog, a vertex shader is used. The shader calculates the fog value according to the formula mentioned above and puts the result into the FOG register of the vertex shader ALU. This register is used by the FFP in the fog blending that takes place after the pixel processing of the graphics pipeline. The complete vertex shader code follows:
VS_OUTPUT main(const VS_INPUT Input) { float4 clpPos, camPos; // Init output VS_OUTPUT Out = (VS_OUTPUT) 0; // Retrieve fog parameters float fFogEnd = fFog.x; float fFogStart = fFog.y; // Calculate the clip space position Out.Position = mul(Input.Position, matWorldViewProj); // Simply pass on the texture coords and the vertex color Out.Tex0.xy = Input.Tex0.xy; Out.Diffuse = Input.Diffuse; // Calculate vertex position in camera space camPos = mul(Input.Position, matWorldView); // Calculate the linear fog factor float fFogRange = fFogEnd-fFogStart; float fVertexDist = fFogEnd - camPos.z; float f = clamp((fVertexDist/fFogRange), 0.0f, 1.0f); // Write the calculated factor to the FOG register Out.Fog = f;
156
return Out; }
Lets have a look at the code parts that are relevant for fog calculations. The application supplies the shader with the settings for ZFogStart and ZFogEnd by filling the first two components of the shader constant fFog. For better readability, these values are extracted and placed in variables.
float fFogEnd = fFog.x; float fFogStart = fFog.y;
The linear fog interpolation is done by dividing the difference between the far fog end and the camera space z coordinate of a vertex (ZFogEnd ZDepth) by the range over which fog is applied (ZFogEnd ZFogStart).
float fFogRange = fFogEnd-fFogStart; float fVertexDist = fFogEnd - camPos.z; float f = clamp((fVertexDist/fFogRange), 0.0f, 1.0f);
This calculation produces f values of 1.0 for vertices that are closer to the camera than ZFogStart and values of 0.0 for vertices with distance ZFogEnd or greater. The result is clamped to the range [0.0, 1.0] to stick to the conventions of the FOG register usage of the FFP f is used in the . fog blending phase of the FFP to mix between fragment color and fog color. As you might have seen, the calculation above uses the plane-based approach in determining the distance by simply evaluating the camera space z coordinate of a scene point. For a very obtuse field of view, this plane-based calculation may produce some visual anomalies when using a rotating camera. In this case, the problem is that distant objects may enter and leave the fog range during minor camera movements. Normally, this works quite well but depends on the special setting of an application. We see the more accurate distance calculation later on when we discuss exponential squared fog.
157
Application Settings
There is indeed not much to say about the application that drives this shader besides the fact that the fog calculation of the FFP has to be enabled prior to rendering the scene. This is necessary because we want to make the FFP use the value calculated by our vertex shader.
m_pd3dDevice->SetRenderState(D3DRS_FOGCOLOR, D3DCOLOR_XRGB(100,100,100)); m_pd3dDevice->SetRenderState(D3DRS_FOGENABLE,TRUE); ... m_pd3dDevice->DrawPrimitive( D3DPT_TRIANGLELIST, 0, m_nPolyCount); ... m_pd3dDevice->SetRenderState(D3DRS_FOGENABLE,FALSE);
In addition to these settings, the combined world-view-projection matrix as well as a vector with fog parameters is provided to the shader.
m_pd3dDevice->SetVertexShaderConstantF(0, (float*)&m_matWorldViewProj, 4); m_pd3dDevice->SetVertexShaderConstantF(10, (float*)&fFog, 1);
158
Fog Equation
f = 1 e
( d* g )
= e - ( d * g ) = e - ( distance* density )
Exponential fog calculates the fog factor f based on the formula above. As you can see in Figure 3, the effect is a more rapid decrease in density than seen for linear fog.
When we recall the fact that fog is caused by a reduction of light intensity along a ray from a point in the scene toward the virtual camera, we can see that exponential fog very much corresponds to the real-world phenomenon. In this simplified model, we assume a constant fog density over the scene. Think of light from a scene point traveling along a ray in multiple steps, where each step has a predefined unit length. By the time the light has crossed the first unit along this ray, its intensity is reduced by a constant factor (determined by the chosen density). The next unit starts with a reduced light intensity, which again is reduced by the same constant reduction factor applied to the previous ray. When choosing very small segments, we end up with an exponential behavior that is well simulated by the use of the e-function.
159
Implementation
This example no longer relies on the fog blending of the FFP but uses a vertex shader to compute the fog factor and a pixel shader to do the blending of fragment color with fog color. As with the linear fog implementation, we use the camera space z-depth as a distance factor. In contrast, the DirectX FFP determines the distance factor by using either the Z or the W value of a vertex. For more information on this implementation, please refer to [Rogers].
Shader
The complete vertex shader code follows:
VS_OUTPUT main(const VS_INPUT Input) { float4 clpPos, camPos; // Init output VS_OUTPUT Out = (VS_OUTPUT) 0; // Calculate the clip space position clpPos = mul(Input.Position, matWorldViewProj); Out.Position = clpPos; // Simply pass on the texture coords and the vertex color Out.Tex0.xy = Input.Tex0.xy; Out.Diffuse = Input.Diffuse; // Calculate vertex position in camera space camPos = mul(Input.Position, matWorldView); // Extract the fog parameters float fDensity = fFog.x; float fFogEnd = fFog.y; // Calculate the distance. // Camera space z coords scaled to have a value of 4 at distance: FogEnd float fDist = camPos.z/fFogEnd*4; // Exp calculation float f = exp(-fDist*fDensity);
160
// Set the fog value Out.FogVal.x = f; // Passed to pixel shader using color register return Out; }
The parameters of the shader are density g and fog end distance ZFogEnd. A fog starting distance is not needed here, since fog blending starts immediately at the virtual camera. Both parameters are provided by the application in constant registers and assigned to some temporary variables using the following instructions:
float fDensity = fFog.x; float fFogEnd = fFog.y;
When you take a look at Figure 3, you can see that the function quickly decreases to 0 with increasing input values. For input values around 4.0, the function is already close to 0. We can use this fact to scale the camera space z-depth in a way that results in a value of 4.0 at distance ZFogEnd.
float fDist = camPos.z/fFogEnd*4;
To calculate the value of an exponential function, the exp() library function is used on the product of scaled distance and density. The way in which the distance scaling is chosen ensures a fog factor of nearly 0.0 for vertices with distance ZFogEnd and of nearly 1.0 for vertices near the camera.
float f = exp(-fDist*fDensity);
The last thing we have to do is store the result in an output register so that it becomes accessible to a pixel shader.
Out.FogVal.x = f; // Passed to PixelShader using color register
The variable Out.Fog is bound to the COLOR1 register of the vertex processing ALU, which is not used by any color value (e.g., specular) in this shader.
161
The pixel shader used in this example is very simple. It looks up the base texture of the fragment and does the blending, according to the fog value provided by the previous vertex program. The complete fragment program follows:
PS_OUTPUT main(const PS_INPUT Input, uniform sampler2D { // Init output PS_OUTPUT Out = (PS_OUTPUT) 0; // Retrieve base texture float4 colorBase = tex2D(baseTexture, Input.Tex0); // Fog blending float f = Input.FogVal.x; Out.Color = lerp(colorFog, colorBase*Input.Diffuse.xyzz, f); return Out; } baseTexture)
The last two statements are the only ones of interest for our fog examination:
float f Out.Color = Input.FogVal.x; = lerp(colorFog, colorBase*Input.Diffuse.xyzz, f);
First, we extract the interpolated fog value from the COLOR1 register, which was prepared by the vertex program. Then a lerp is performed between the fog color and the base color. Since we are no longer using the FFP for fog blending, the fog color is provided as constant input to the shader. It is no longer necessary to set the fog color as a renderstate. To enhance the visual output, the base color, which has been looked up from a texture, is modulated by the diffuse color of a fragment before the final blending takes place.
162
Application Settings
As we are not using the FFP for blending fog and fragment color, this part of the pipeline can be disabled for the whole application.
m_pd3dDevice->SetRenderState(D3DRS_FOGENABLE,FALSE);
The vertex shader is provided with the combined world-viewprojection matrix and a vector containing fog parameters. The pixel shader, on the other hand, needs the fog color that is used during blending.
m_pd3dDevice->SetVertexShaderConstantF(0, (float*)&m_matWorldViewProj, 4); m_pd3dDevice->SetVertexShaderConstantF(10, (float*)&fFog, 1); m_pd3dDevice->SetPixelShaderConstantF(5, (float*)&colFog, 1);
163
Fog Equation
f = 1 e
( d* g ) 2
= e - ( d * g ) = e - ( distance* density )
Exponential squared fog is similar to the exponential fog that we discussed above. However, when looking at Figure 5, you can see that, compared with the exponential function, the squared function has a flatter slope for near distances followed by a steeper slope later on. Looking at an application, this leads to a wider range of totally clear visibility surrounding the virtual camera, followed by an intense density increase toward ZFogEnd.
On current generation graphics hardware, the slightly more complex exponent calculation should not produce a big performance hit, but you may want to keep this in mind when writing shaders for older boards, where a difference may be noticeable.
164
Implementation
Of course, the implementation is quite similar to the exponential fog version, but there are two main differences to point out. In the first place, we do not use the camera space z-depth of a vertex as the distance factor. Secondly, the factor calculation is changed to include the additional multiplication. Using the camera space z-depth as the distance factor may produce some artifacts because all vertices that are lying on the same z-plane are getting assigned the same distance factor. This is an approximation for the real distance and is in fact not correct. Taking the z-depth as distance is only right for vertices directly in front of the camera. As vertices move toward the outer screen regions, the error of this computation increases more and more. Depending on your application settings, this may not be noticeable, but for extremely wide fields of view and areas with a fast increasing slope of fog density (especially for exponential squared fog), this could cause some problems. To circumvent this behavior, the exact distance of every vertex to the virtual camera should be calculated. This is what we will do in the implementation of this sections shader.
Shader
The complete vertex shader code follows:
VS_OUTPUT main(const VS_INPUT Input) { float4 clpPos, worldPos; // Init output VS_OUTPUT Out = (VS_OUTPUT) 0; // Calculate the clip space position clpPos = mul(Input.Position, matWorldViewProj); Out.Position = clpPos; // Simply pass on the texture coords and the vertex color Out.Tex0.xy = Input.Tex0.xy; Out.Diffuse = Input.Diffuse;
165
// Extract the fog parameters float fDensity = fFog.x; float fFogEnd = fFog.y; // Calculate the vertex position in world space worldPos = mul(Input.Position, matWorld); // Calculate the distance to the viewer in world space float fDistance = distance(worldPos, vCamera); // The distance is scaled to have a value of 4 at distance: FogEnd float fDist = fDistance/fFogEnd*4; // Exp squared calculation float fFog = exp(-(fDist*fDensity)*(fDist*fDensity)); // Set the fog value Out.FogVal.x = fFog; return Out; }
This time the distance factor is computed by calculating the exact distance between a vertex and the camera in world space. To get the world space position of a vertex, it is multiplied by the applications world matrix supplied in some of the constant registers. This is similar to the one already performed to transform the vertex into clip space, but the difference is that the world matrix is used instead of the combined world view matrix.
camPos = mul(Input.Position, matWorldView);
Now, as we have computed the world space vertex position, we simply use the standard library function distance() to determine the exact distance between a vertex location and the camera. The coordinates of the virtual camera are supplied to the shader within the register bound to the vCamera variable.
float fDistance float fDist = distance(worldPos, vCamera); = fDistance/fFogEnd*4;
As seen in the last section, the distance is scaled to reach a value of 4.0 at the distance ZFogEnd.
166
The fragment program is identical to the one used in the last section, so please refer back to the exponential fog implementation for details.
Application Settings
Application settings are also very similar to those used by the exponential fog implementation in the last section. However, because of the changed distance computation, we have to provide the shader with the necessary world matrix and the position of the virtual camera.
m_pd3dDevice->SetVertexShaderConstantF(8, (float*)&m_matWorld, 4); m_pd3dDevice->SetVertexShaderConstantF(12, (float*)&fFog, 1); m_pd3dDevice->SetVertexShaderConstantF(13, (float*)&vCamera, 1);
167
Figure 7: Height and distance relation between virtual camera (a) and scene point (c).
The height difference DY is computed by subtraction of the y coordinates of the two points. DY = abs (YC - Y P ) The distance DD along the x-z plane is computed by: DD = ( X C - X P ) 2 + ( Z C - Z S ) 2
168
What is left for our layered fog model is the computation of the fog density at various scene points. Modeling complex fog usually requires the integration of density along a line from c the virtual camera to a point in the scene (e.g., density(a,c)= a density(t)dt, where density(a,c) is the total density used in the computation of the fog factor for the line between the camera (c) and a scene point (a) and where density(t) defines the density at height t). This is necessary because realistic fog normally does not have a constant density distribution. However, to simplify things, our implementation assumes a constant density increase for an increasing height difference between points inside the fog area. This assumption allows us to choose 0.5*DY2 as the density integral. Now we have gathered all the information needed to define a suitable density function: DD density(a,c)= 1 + DY
2 c
DD ( DY ) 2 density( t )dt = 1 + 2 DY a
Implementation
With all the theory in place, the implementation is a straightforward translation of the formulas described above into shader code. As with the previous techniques, we use a vertex shader to output the fog value at a vertex and a pixel shader that does the blending with the base texture.
Shader
The complete vertex program follows:
VS_OUTPUT main(const VS_INPUT Input) { float4 clpPos, camPos, worldPos; float fDistance;
169
// Init output VS_OUTPUT Out = (VS_OUTPUT) 0; // Calculate the clip space position clpPos = mul(Input.Position, matWorldViewProj); Out.Position = clpPos; // Simply pass on the texture coords and the vertex color Out.Tex0.xy = Input.Tex0.xy; Out.Diffuse = Input.Diffuse; // Get fog parameter float fFogTop = fFog.x; float fFogEnd = fFog.y; float fFogRange = fFog.x; // Calculate the world position worldPos = mul(Input.Position, matWorld); // Calculate the distance to the viewer fDistance = distance(worldPos, vCamera); // Project both points into the x-z plane float4 vCameraProj, vWorldProj; vCameraProj = vCamera; vCameraProj.y = 0; vWorldProj = worldPos; vWorldProj.y = 0; // Scaled distance calculation in x-z plane float fDeltaD = distance(vCameraProj, vWorldProj)/fFogEnd*2.0f; // Height-based calculations float fDeltaY, fDensityIntegral ; if(vCamera.y > fFogTop) { if (worldPos.y < fFogTop) { fDeltaY = (fFogTop - worldPos.y)/fFogRange*2; fDensityIntegral = (fDeltaY * fDeltaY * 0.5f); } else {
170
fDeltaY = 0.0f; fDensityIntegral = 0.0f; } } else { if (worldPos.y < fFogTop) { float fDeltaA = (fFogTop - vCamera.y)/fFogRange*2; float fDeltaB = (fFogTop - worldPos.y)/fFogRange*2; fDeltaY =abs(fDeltaA -fDeltaB); fDensityIntegral = abs((fDeltaA * fDeltaA * 0.5f) - (fDeltaB * fDeltaB * 0.5f)); } else { fDeltaY = abs(fFogTop - vCamera.y)/fFogRange*2; fDensityIntegral = abs(fDeltaY * fDeltaY * 0.5f); } } float fDensity; if (fDeltaY != 0.0f) { fDensity = (sqrt(1.0f + ((fDeltaD / fDeltaY) * (fDeltaD / fDeltaY)))) * fDensityIntegral; } else { fDensity = 0.0f; } float f= exp(-fDensity); // Set the fog value Out.FogVal.x = f; // Passed to PixelShader using color register return Out; }
Lets revisit the fog-related stuff, step by step. The application supplies the shader with the parameter settings for the top of the fog, which is the height above the ground
171
up to where we want ground mist to last. The second parameter is ZFogEnd, which is used to determine the distance where scene points should be completely encompassed by the fog. We are also using a temporary variable called FogRange, which in our case is the same as the top of the fog because this sections ground fog implementation starts at height zero.
float fFogTop = fFog.x; float fFogEnd = fFog.y; float fFogRange = fFog.x;
To compute the distance between a points world space position and the virtual camera in the x-z plane, both points are projected onto this plane by setting their y coordinate to 0.
vCameraProj vCameraProj.y vWorldProj vWorldProj.y = = = = vCamera; 0; worldPos; 0;
Getting the distance between the two points is a matter of calling the library function distance(). As you have seen in the Theory and Equations section above, the distance DD contributes to the exponent of the exponential function. To get a desirable result, this factor, along with other factors that influence the exponent (e.g., DY), have to undergo some application-specific scaling. The scaling itself is influenced by the dimensions of the underlying world coordinate system. In our case, DD is scaled to have a value of 2.0 for points at distance ZFogEnd.
float fDeltaD = distance(vCameraProj, vWorldProj)/fFogEnd*2.0f;
Regarding the calculation of DY and the density to use, we have to distinguish several cases. Thanks to the latest addition of control flow to shader languages, this is easily done using if-then-else statements. First of all, we have to distinguish if the virtual camera is above or beneath the top of the fog. Lets start with the case that the camera is above the fog top.
172
Next we need to find out if the vertex that is currently being processed is above or below the fog top. If it is above the fog top, we do not want to apply fog to this vertex, and we simply set DY and density to 0. If, on the other hand, the vertex is below the fog, we want to apply fog. According to our theory, DY is calculated as the difference between the top of the fog and the vertex height. This should give us more fog for vertices that reside closer to the ground and less fog for vertices that have just entered the fog area. As with DD, we apply our application-defined scaling to get proper results. Afterward we are ready to determine the density with the forDY 2 . mula 2
if (worldPos.y < fFogTop) { fDeltaY = (fFogTop - worldPos.y)/fFogRange*2; fDensityIntegral = (fDeltaY * fDeltaY * 0.5f); } else { fDeltaY = 0.0f; fDensityIntegral = 0.0f; }
With the second case, the virtual camera is located inside the fog, which means its height is between the ground (height zero) and the top of the fog. For each processed vertex, we have to make the distinction of whether the vertex is above or below the fog top. Being above the fog top means that the camera is looking toward a point in a non-foggy area. So DY is computed as the difference between the top of the fog and the camera height. In case the processed vertex is underneath the fog top, we have to calculate two densities. The first is computed for the line between the eye and the top of the fog, and the second is computed for the line between the vertex height and the top of the
173
fog. The final value results from the difference between these two densities.
if (worldPos.y < fFogTop) { float fDeltaA = (fFogTop - vCamera.y)/fFogRange*2; float fDeltaB = (fFogTop - worldPos.y)/fFogRange*2; fDeltaY =abs(fDeltaA -fDeltaB); fDensityIntegral = abs((fDeltaA * fDeltaA * 0.5f) - (fDeltaB * fDeltaB * 0.5f)); } else { fDeltaY = abs(fFogTop - vCamera.y)/fFogRange*2; fDensityIntegral = abs(fDeltaY * fDeltaY * 0.5f); }
Now all terms have been computed that contribute to the density calculation. However, the case for DY being zero has to be handled separately, as we otherwise run into a division by zero error. The solution is to set the density to 0 in this case:
if (fDeltaY != 0.0f) fDensity = (sqrt(1.0f + ((fDeltaD / fDeltaY) * (fDeltaD / fDeltaY)))) * fDensityIntegral; else fDensity = 0.0f;
With the last two lines, the fog value to be used by the pixel shader is computed using the library function exp() and handed over inside the COLOR1 register.
float f = exp(-fDensity); Out.FogVal.x = f; // Passed to pixel shader using color register
Application Settings
As seen in the other sections, the application provides the shader with the necessary settings for fFogTop, fFogEnd, fFogRange, and CFog in addition to the usual matrices and camera information.
174
175
c
i=1
( x, y, z )
...where K(x,y,z) is the sum of all periodic functions applied over the scene and ci represents a single periodic function. The input to the exp() function has to be negative; otherwise the fog calculation will not work. According to [Biri], K(x,y,z) is chosen as: K(x,y,z)=
1 1 1 1 z 1 + cos( 5 y) + cos(7 ( y + 0.1 x)) + cos( 5 ( y - 0.1 x)) + cos( x) cos 10 2 5 5 2
Using world coordinates as input to this function gives us a very nice varying fog distribution over the whole scene. This works quite well for a static camera but results in some problems for a moving one. The periodic distribution caused by the trigonometric functions is very apparent and easy to notice.
176
To make this technique more applicable to a moving camera, we use the calculated factor K(x,y,z) along with a normal exponential fog calculation. This leads to the following fog calculation: f=exp(dg)+K(x,y,z) Up until now, we have achieved a non-uniform fog density of the scene, which seamlessly integrates with exponential fog. Points near the camera are less foggy than distant ones, even when moving the camera. Animating the fog is nothing more than simply adding a time varying term to the density calculation. Depending on which term of K(x,y,z) you choose to modify with a varying factor, you can control the direction of the fog animation and therefore the direction of simulated wind. Of course, you can control the turbulence of the fog by choosing a constantly increasing animation term or by using a function to compute such a wind factor.
Implementation
The implementation only uses a vertex shader to compensate for the performance hit of making extensive use of trigonometric functions. The achieved results are normally quite sufficient and do not justify calculations for every fragment.
Shader
The complete vertex shader code follows:
VS_OUTPUT main(const VS_INPUT Input) { float4 clpPos; // Init output VS_OUTPUT Out = (VS_OUTPUT) 0; // Calculate the clip space position clpPos = mul(Input.Position, matWorldViewProj);
177
Out.Position
= clpPos;
// Simply pass on the texture coords and the vertex color Out.Tex0.xy = Input.Tex0.xy; Out.Diffuse = Input.Diffuse; // Get float float float fog parameter fAnim = fFog.x; fFogEnd = fFog.y; fDensity = fFog.z;
// Calculate the distance. (Same as for exp-fog) // Camera space z coords scaled to have a value of 4 at distance FogEnd float4 camPos = mul(Input.Position, matWorldView); float fDist = camPos.z/fFogEnd*4; // Exp fog calculation float fExpFog = exp(-fDist*fDensity); // Animation is calculated based on world coordinates float4 worldPos = mul(Input.Position, matWorld); // Do the animation: -(1+0.5*cos(5*PI*z)+0.2*cos(7Pi*(z+0.1*x))+0.2*cos //(5*PI*(z-0.05*x))+0.1*cos(PI+x)*cos(PI*y/2)) float k = -1-0.5*cos(5*3.14*worldPos.z+fAnim)-0.2*cos (7*3.14*(worldPos.z+0.1*worldPos.x))-0.2*cos (5*3.14*(worldPos.z-0.05*worldPos.x))-0.1*cos (3.14*worldPos.x)*cos(3.14*worldPos.y/2); // Final fog is addition of exp and animation float f = fExpFog + (camPos.z/fFogEnd)/4.0f*k; // Set the fog value Out.Fog = f; return Out; }
Most of the shader code should be familiar by now, so lets have a look at the interesting parts. The shader uses the normal settings for fog end and density in its exponential fog calculation. The time varying term fAnim is
178
used to modify K(x,y,z) and therefore animates the fog density over time.
float fAnim float fFogEnd float fDensity = fFog.x; = fFog.y; = fFog.z;
The K(x,y,z) calculation is a direct translation of the equation discussed above into code. Notice that the time varying term fAnim is added to the first term, modifying the world space z coordinates. This achieves the effect of wind blowing along the world z-axis.
float k = -1-0.5*cos(5*3.14*worldPos.z+fAnim)-0.2*cos (7*3.14*(worldPos.z+0.1*worldPos.x))-0.2*cos (5*3.14*(worldPos.z-0.05*worldPos.x))-0.1*cos (3.14*worldPos.x)*cos(3.14*worldPos.y/2);
To make the fog applicable for a moving camera, it is combined with an exponential fog factor. Notice that we again need to apply some application-dependent scaling to K(x,y,z) in order to achieve desirable results.
// Final fog is addition of exp and animation float f = fExpFog + (camPos.z/fFogEnd)/4.0f*k;
Application Settings
Besides the usual parameters, the shader is provided with a value for the fog density and fog end, used by the exp fog calculation. Additionally, a single float value is passed to the shader that is constantly increased from frame to frame in order to achieve the discussed animation.
Conclusion
With our discussion of basic environmental fog, simple volumetric fog, and more interesting animated fog, we have seen much of what can be accomplished by adding fog to an application. For sure, each application will have different requirements and limitations regarding the usage of fog, but with the ground covered
179
above, it should be easy to adapt and combine effects to achieve some unseen effects. However, some closing notes must be mentioned here. All of the pixel shaders used are very simple and do nothing more than a blend operation. Moving the fog calculation from the vertex shader to the pixel shader leads to more accurate results. But it must be decided on a case-by-case basis whether the additional computing cost (caused by doing computation per fragment instead of per vertex) is worth it. Generally, you will not notice a big difference because fog mostly deals with uniform colors. As mentioned, this is a decision to be made with your specific application requirements in mind. Another point is the usage of constant terms in the discussed shaders. To have a direct connection to the equations that are defining a fog technique, most of the shaders are calculating constant terms inside the shader itself. For a real-world application, this should (of course) not be the case. Precomputing constant terms on the CPU and providing them to the GPU once per frame is far more efficient than doing repetitive calculations for each execution of the shader program. The last point that needs to be mentioned is the handling of fog color. All of our shaders use a constant fog color, but this is not a requirement. Doing additional calculations to determine the fog color for a vertex or fragment can result in some really interesting and exciting effects that can give your application the final touch.
References
[Biri] Biri, V S. Micheling, and D.Arqus, Real-Time Animation ., of Realistic Fog, Thirteenth Eurographics Workshop on Rendering, 2002. [Legakis] Legakis, J., Fast Multi-Layer Fog, ACM SIGGRAPH 98 Conference Abstracts and Applications, p. 266. [Rogers] Rogers, D., Z-buffering, Interpolation and More W-buffering, nVidia Corp., www.nvidia.com/developer.
Introduction
Because we are using monitors to display information, the whole computer-generated world is turned into 2D before we see it. Then, only our brain (with great help from our memory and past experiences) can restore the feeling of three dimensions. Shadows are one of the most important guidelines in this process because they give us information about the position of objects in a scene. This chapter covers the implementation of the shadow map algorithm with Direct3D 9 and offers some improvements. First we fight depth compare errors by using a back-faced shadow map; we store depths only for pixels facing away from the light. We improve standard percentage closer filtering (PCF) by additional bilinear filtering performed in the pixel shader. The result is a highly optimized pixel shader that uses 64 arithmetic instructions to produce smooth shadows with reduced aliasing problems even for low-resolution maps.
181
182
Shadow Algorithm
The implementation of shadows presented here uses a wellknown algorithm published in 1978 by Lance Williams in [1]. Here, we briefly describe each step of the original technique. In the first step we render the image from the lights position and store the distance information for each pixel that is visible from the lights position into a texture called a shadow map. The original technique uses z-buffer information for the shadow map. In the second step we render the image from the cameras position. We project the shadow map texture onto the geometry just like we did for the spotlight (see the Advanced Lighting and Shading with Direct3D 9 article). For each rendered pixel we compute its actual distance from the light and compare this value with the value stored in the texture. If the actual distance is greater, the pixel is shadowed by something nearer and we skip lighting. Otherwise, the pixel is not occluded and we can illuminate it. The algorithm is usable for directional lights or spotlights, but an extension for omnidirectional lights exists (see [3]). The advantage of this algorithm is that it is an image space algorithm and no knowledge of geometry is required. The main disadvantage is that it produces aliasing effects due to texture resolution. Before Direct3D 9, low precision of textures was one of the main problems in the implementation of this method in real time.
183
The implementation presented here does not use the z-buffer as the shadow map (although the z-buffer is still enabled while creating the shadow map), but we render to a 32-bit floating-point texture. The distance is computed in world space and everything between the spotlights near and far clipping plane (in our case, derived from the 3D Studio Max Spotlight properties called Attenuation Near - Start and Attenuation Far - End) is mapped linearly into the range [01] with the following equation. This solution does not provide better quality (compared to plain usage of the z-buffer as the shadow map), but it is easier for prototyping, as we can easily test how the precision of texture affects the algorithm.
184
distance. This minimizes depth comparison errors but finding a good bias for the whole scene is not a trivial task. The third image in Figure 3 shows what happens if the depth bias is too large.
Figure 3: Depth bias issues a) no depth bias b) just right depth bias c) too high bias
This implementation uses a different approach to fight depth test issues. Instead of finding a bias value, a shadow map is rendered with front-face culling enabled. Rendering only back faces (those facing away from the light source) causes lit faces to always pass the depth test. The depth bias issue can be ignored for back faces because these are already shadowed by a lighting and shading algorithm. Cases where the geometry is really supposed to be shadowed by another one are handled correctly unless occluder and occluded pixels are too close (but in this case, we cannot see a shadowed pixel anyway). This is not the most robust solution, but it works well for scenes with 2-manifold (or almost 2-manifold, like the teapot in our case) objects where the distance between front- and back-facing faces is not very small (with respect to the precision of the shadow map). Game geometry in most cases satisfies this need, and if not, it can be tuned during the authoring process. The solution can be easily improved if we add depth bias selectively only for objects that do not meet the criteria. Face culling is disabled, and the depth bias value is added for this geometry. This does not interfere with our implementation. The second image in Figure 4 shows that only 8-bit back-faced depth maps produce very good results.
185
Figure 4: Back-facing depth map a) back-facing depth map b) scene rendered using back-faced depth maps that have 8-bit precision
Until now, PCF was not possible in real time (or only in very limited form). With the power of Direct3D 9, we can perform it in the pixel shader. We implement a slightly improved version of the
186
PCF to gain even smoother results. After we obtain the binary result in the second part of the PCF for the 3x3 region, we use bilinear filtering on it to get four newly filtered values. For this, we use shadow map coordinates relative to the nearest top-left texel corner. First we perform a linear interpolation between columns of the kernel, and we get a temporary 2x3 kernel. Then we perform linear interpolation on rows and get four filtered values. In the last step, we average filtered values to get the final attenuation. Figure 6 illustrates the improved PCF, and Figure 7 illustrates bilinear filtering. Figure 8 shows the quality comparison.
187
188
//depth of pixel
189
// Output //-----------------------------// oT0 - texture coordinates // oT1 - Light vector (in tangent space) // oT2 - eye vector (in tangent space) // oT3 - projective spotlight texture coordinates // oT4 - distance from light //The following code outputs position and texture coordinates //-----------------------------m4x4 oPos, v0, c4 //vertex clip position mov oT0.xy, v2.xy //Texture coordinates for color texture m4x4 r8, v0, c0 //Transform vertex into world position //The following code generates tangent //-----------------------------m3x3 r11.xyz, v1, c0 //transform mov r11.w, v1.w m3x3 r9.xyz, v3, c0 //transform mov r9.w, v3.w crs r10.xyz, r9, r11 //The cross space base vectors normal N to world space tangent T to world space product to compute binormal NxT
//Computes light and eye vectors and projector's texture coordinates //-----------------------------sub r0, c8, r8 //Build the light vector from light source to vertex nrm r1, r0 //normalize vector m3x3 oT1.xyz, r1, r9 //transform vector with N, T, NxT into tangent space sub r0, c9, r8 nrm r1, r0 m3x3 oT2.xyz, r1, r9 m4x4 oT3.xyzw, v0, c10 //build the eye vector from vertex to camera source //normalize vector //transform vector with N, T, NxT into tangent space //compute projector texture coordinates
//Compute distance from light and normalize it to [0...1] //-----------------------------sub r0, c8, r8 //Build the light vector from light source to vertex dp3 r1.x, r0, r0 //length of vector^2 pow r2.x, r1.x, c14.w //sqrt(length^2) sub r3.x, r2.x, c14.x //Dst - ZNear mul r4.x, r3.x, c14.z //(Dst - ZNear)/(ZFar - ZNear) - normalized position mov oT4, r4.x //Output it
190
191
r6, c8, r0 r5, c9, r0 r4, c10, r0 r3, c11, r0 //Right column
// 4 - Fill 3x3 filtering kernel //-----------------------------texld r10, r10, s2 texld r9, r9, s2 texld r8, r8, s2 texld texld texld texld texld r7, r7, s2 r6, r6, s2 r5, r5, s2 r4, r4, s2 r3, r3, s2
//Left column
//Center column
//Right column
// 5 - Distance comparison we get 3x3 binary kernel //-----------------------------sub r10.x, t4.x, r10.x //Left column sub r10.y, t4.x, r9.x sub r10.z, t4.x, r8.x cmp r1.xyz, r10, c30.g, c30.r //distance comparison sub sub sub cmp sub sub sub cmp r9.x, t4.x, r9.y, t4.x, r9.z, t4.x, r2.xyz, r9, r8.x, t4.x, r8.y, t4.x, r8.z, t4.x, r3.xyz, r8, r7.x //Center column r11.x r6.x c30.g, c30.r //distance comparison r5.x //Right column r4.x r3.x c30.g, c30.r //distance comparison
// 6 - Bilinear filtering of 3x3 binary kernel //-----------------------------mul r0, r0, c29 //get coordinate in texture frc r0, r0 //get fractional part only lrp lrp r10.xyz, r0.x, r2, r1 r11.xyz, r0.x, r3, r2 //interpolate column 1 and 2 //interpolate column 2 and 3
192
1, 2, 1, 2,
//accumulate to get average and normalize //load normal //bias normal to range -1,1 //r11 = normalized normal //r10 = normalized light vector //r9 = normalized eye vector intensities //r0 = (n.l) //r1.g = 2*(n.l) //compute reflectance vector - r1 = // 2(n.l)n - l //r1 = (r.v) //r1 = (r.v)^shi //if (n.l)<0 do not lit anything
// 7 - Setup needed vectors - load //-----------------------------texld r0, t0, s1 mad r1, r0, c31.r, -c31.g nrm r11, r1 mov r1.xyz, t1 nrm r10, r1 mov r1.xyz, t2 nrm r9, r1
// 8 - Compute diffuse and specular //-----------------------------dp3 r0, r11, r10 mul r1, r0, c31.r mad r1, r1, r11, -r10 dp3_sat pow cmp r1, r1, r9 r0.g, r1.r, c2.r r0, r0.r, r0, c31.b
// 9 - Modulate texture with computed intensities //-----------------------------texld r6, t0, s0 //load diffuse texture (gloss map is in alpha) texldp r4, t3, s3 //load projector texture (perspective correct) mul r2, r6.a, r0.g //modulate specular intensity with gloss map mul r2, r2, c1 // and with material's specular and light color mul r3, r6, r0.r //modulate diffuse intensity with texture mad r0, r3, c0, r2 // and with material's diffuse and light color //and add specular mul r0, r0, r4 //modulate it with spotlight texture mul r0, r0, r8.x // and with shadow mov oC0, r0 //color output
193
In the second part of the shader we compute the shadow map texture coordinates. These were computed in the vertex shader, and here we have to perform a perspective-correct texture lookup. We have to adjust coordinates to point exactly at the texel center to have correct shadows, and we use these in the next section to get the additional eight coordinates for a complete kernel. Because of this, we do manual division by w and then use texld instead of simple texldp. The mad instruction does this computation by multiplication with 1/w and addition of 0.5/shadow_map_size for texel center adjustment. We compute coordinates for the remaining filtering kernel in the third part and load the distance information from the shadow map in the next one. In the fifth section we perform a comparison of distance stored in the texture and the actual one. We subtract the stored depth value from the actual one and store the result for each column into components of one register. Then we do a comparison of these values to 0 with one cmp instruction. If the value is less than 0, the actual pixel is not shadowed, and we remember a value of . Otherwise, we store 0. Note that is stored here because in the final step we perform an average of four values. In the beginning of the sixth section, we find the coordinates relative to the top-left corner of the shadow map texel. To do this, we multiply the coordinates by the dimensions of the texture and store only the fractional part. Then we perform bilinear interpolation. With the first two lrp instructions, we can interpolate whole columns with respect to the relative x coordinate. We interpolate final values from individual rows of these columns. The last instruction dp4 performs a four-component dot product. Because we stored the interpolated values in one register and the other register contains vector (1,1,1,1), this dot product actually performs the sum of four values with one instruction. Since these values were already divided by four, this sum is their average final light attenuation by shadow. The rest of the shader performs the per-pixel Phong lighting shown in the Advanced Lighting and Shading with Direct3D 9 article, and the only change is the modulation of the lighting result with shadow.
194
Conclusion
With the possibilities of Direct3D 9, we altered a classic shadow map algorithm so that it produces soft-edged shadows and minimizes depth bias issues. Bilinear filtering of the 3x3 binary kernel from the percentage closer filtering results in very soft shadows even for the low-resolution shadow map. Usage of the back-faced shadow map minimizes the depth compare errors and allows us to use lower-depth maps. This is vital for the Direct3D 8 class of hardware, where two channels of 8-bit textures are used to encode depth information.
195
References
[1] Williams, L., Casting Curved Shadows on Curved Surfaces, Computer Graphics (SIGGRAPH 78 proceedings) August 1978, Vol. 12:3, pp. 270-274. [2] Reeves, W.T., D.H. Salesin, and R.L. Cook, Rendering Antialiased Shadows with Depth Maps, Computer Graphics (SIGGRAPH 87 proceedings) July 1987, pp. 283-291. [3] Brabec, S., T. Annen, and H.P Seidel, Shadow Mapping for . Hemispherical and Omnidirectional Light Sources, Advances in Modelling, Animation and Rendering, J. Vince, R. Earnshaw, eds., Springer: London, 2002, pp. 397-408.
Introduction
One of the best visual improvements that we can make to a rendered scene is to add shadows. Shadows greatly enhance the realism of a rendered scene and provide viewers with important visual cues about object placement within the scene. Rendering cost, memory constraints, or hardware limitations sometimes make the generation of accurate shadows infeasible. However, it is often better to have at least some form of shadow, albeit a rough approximation. This is why some older games use patches of dark circular textures projected onto the surface that the game characters stand on to approximate shadowing by the characters. These types of hacks are no longer acceptable by todays gamers who have ever-increasing expectations. Allan Watt [13] discussed four major approaches to shadow generation that include polygon projection with scan line testing, shadow polygon through visible surface, shadow volume, and shadow z-buffer. Of the four, only the shadow volume and shadow z-buffer approaches are still commonly employed today. This article concerns the shadow volume approach, which is fast becoming a fixture in newer games.
197
198
Although not a clear-cut winner, shadow volume implementation does provide several advantages over other shadowing techniques. It provides accurate hard shadows, and occluder self-shadowing is inherent in the technique. For a scene full of shadow casting occluders, shadow volume also provides accurate inter-occluder shadowing. Shadow volume is also fast gaining popularity with professional game and graphics developers. The extensive use of stencil shadow volumes in John Carmacks new Doom III engine and the impressive Power Render X game engine [4] are most notable. With this rising popularity comes a wave of enthusiastic hardware support from major graphics hardware vendors. Industry powerhouse nVidia, for example, has specifically added new capabilities to provide hardware support for shadow volume implementations in the GeForce family of consumer graphics cards. ATI Technologies Inc. has also included accelerated shadow volume rendering capabilities into its SMARTSHADER 2.0 technology that comes with graphics cards, such as the highly successful Radeon 9700. It is also possible to combine shadow volume implementation with other rendering techniques, such as projective texturing, volume texturing, or shadow mapping, to achieve highly realistic soft shadows or distance attenuated shadowing. This article covers both the theoretical and practical aspects of stencil shadow volumes. For readers already well versed in the theory, the Implementation on CPU and Implementation on GPU (Shaders) sections provide details on implementation utilizing the CPU and GPU. For those unfamiliar with the shadow volume methodology, the Shadow Volume Concept and Problems and Solutions sections provide detailed discussions on the theories and algorithms.
199
Figure 1 shows a light source, an occluder, and a shadow receiver. The shaded region depicts the shadow volume generated by the occluder. We work on the basis that our light sources are attenuated omnidirectional point lights. This assumption is actually an added advantage of the stencil shadow volume technique. This is because generating shadows for omnidirectional light sources using view-dependent techniques such as shadow mapping or projective texturing is tricky, if not inefficient, on modern hardware. View-dependent techniques are very good for generating shadows
200
created by directional light sources such as torchlight. However, the stencil shadow volume is more flexible and can be trivially altered to work for directional light sources. All the occluders that we work with are also assumed to be solid polygonal objects with no transparency or alpha that distorts the shadows generated. There is an added requirement that the occluders be made up of meshes that are closed volume; we discuss this requirement in more detail in the Silhouette Determination section. Lastly, we ignore the shadow volume projections and consequently the shadows of the shadow receivers. You probably noticed that in Figure 1 the shadow volume is supposed to extend to infinity. This is how the name infinite shadow volumes came about. Infinite shadow volumes help to solve a problem known as finite shadow cover, which we discuss in the Finite Shadow Cover section. The implementation of infinite shadow volume is presented in the Vertex Shader Implementation (InfiniteGPU) section.
Figure 2 shows the silhouette of a sphere with respect to the light source. In essence, silhouettes are simply the outline of occluders as seen from the position of light sources. The shadow volume of
201
an occluder is formed when we extrude the silhouette by a certain distance, finite or infinite, into the direction of incidental light rays originating from the light source. Using triangles as primitives in our meshes, a silhouette is simply made up of a chain of edges that consist of two vertices each. It should be noted at this point that shadow volume extrusion differs for different light sources. For point light sources (as depicted in Figure 2), the silhouette edges extrude exactly point for point. For infinite directional light sources, the silhouette edges extrude to a single point at infinity. We go into the details of determining silhouette edges and the creation of the shadow volumes in the Implementation on CPU and Implementation on GPU (Shaders) sections. The magnitude of the extrusion can be either finite or infinite. There are two techniques for implementing stencil shadow volumes. The original technique is known as depth-pass while the other, a newer variant, is known as depth-fail. Lets look at how these two techniques differ in concept and implementation before we go into the problems that plague both of them.
Depth-pass (z-pass)
202
Figure 3 shows the numerous possible viewing directions of a player in the scene. The numbers at the end of the arrows are the values left in the stencil buffer after rendering the shadow volume. Fragments with non-zero stencil values are considered to be in shadow. The generation of the values in the stencil buffer is the result of the following stencil operations: 1. Render front faces. Increment stencil value for depth-pass. Do nothing for depth-fail. Disable draw to frame and depth buffer. 2. Render back faces. Decrement stencil value for depthpass. Do nothing for depth-fail. Disable draw to frame and depth buffer. The above algorithm is known as the depth-pass stencil shadow volume technique, since we manipulate the stencil values only when the depth test passes. Depth-pass is also commonly known as z-pass. Lets assume that we had already rendered the objects onto the frame and depth buffer prior to the above stenciling operations. This means that the depth buffer would have been set with the correct values for depth testing (or z-testing if you like). The two leftmost rays originating from the eye position do not hit any part of the shadow volume (in gray), hence the resultant stencil values are 0, which means that the fragment represented by these two rays are not in shadow. Now lets trace the third ray from the left. When we render the front face of the shadow volume, the depth test would pass and the stencil value would be incremented to 1. When we render the back face of the shadow volume, the depth test would fail since the back face of the shadow volume is behind the occluder. Thus, the stencil value for the fragment represented by this ray remains at 1. This means that the fragment is in shadow since its stencil value is non-zero. To be convinced of the viability of the technique, the reader should inspect the derivation of the stencil values for the remaining two rays. While going through the algorithm of the depth-pass technique, we are effectively doing per-pixel shadow volume counting to determine the number of times a ray representing a pixel
203
enters and leaves the shadow volumes of the occluders. Not surprisingly, it is the same concept employed in ray-tracing techniques, whereby rays are projected to calculate the color values that represent on-screen pixels. In the case of stencil shadow volumes, we are only interested in whether a pixel is in shadow or not. Does shadow volume counting work for multiple overlapping shadow volumes?
Even when the shadow volumes are overlapping, as shown in Figure 4, shadow volume counting using the stencil buffer will still work. Any point on a geometric surface in a scene can only exist in three positions with respect to the shadow volumes. The point can be in front of the shadow volumes, behind the shadow volumes, or nested within the shadow volumes. For the first case whereby the point is in front of the shadow volumes, shadow volume counting gives a result of 0, as all depth tests fail, indicating that the point is not in shadow. The ray on the left in Figure 4 illustrates the second case, whereby the point is behind the shadow volume and thus not in shadow with a stencil value of 0.
204
The ray on the right illustrates the third case, whereby the point is nested within the shadow volumes and thus in shadow with a non-zero stencil value. The ingenuity of counting shadow volumes is that self-shadowing and inter-occluder shadowing is totally embedded into the algorithm! John Carmack [5] and the team of Bill Bilodeau and Mike Songy [6] independently presented an alternative technique that is the direct reverse of the depth-pass stencil algorithm. Consequently, this alternative technique was aptly named the depth-fail technique. The depth-fail technique is also commonly known in the developer community as Carmacks Reverse. Why did John Carmack, Bill Bilodeau, and Mike Songy even bother to come up with an alternative stencil algorithm since the depth-pass technique seems to work great? Depth-pass works flawlessly, at least most of the time. However, when the eye point enters the shadow volume, the depth-pass algorithm fails utterly.
Figure 5: Depth-pass stencil operations fail when the eye point is within the shadow volume.
When the eye point is within the shadow volume, the front face of the shadow volume does not get rendered at all. This disrupts the shadow volume counting and results in erroneous values left in
205
the stencil buffer. Figure 5 illustrates two cases in which the wrong shadow volume count is provided by the depth-pass stencil operations. The ray on the left should result in a non-zero stencil value, while the ray on the right should result in a stencil value of 0. Lets look into the mechanics of the depth-fail algorithm that allow it to handle this situation properly.
Depth-fail (z-fail)
When the eye point enters a shadow volume, the front faces of the shadow volume are clipped away by the near plane of the view frustum. This clipping is the culprit that causes disruption to the depth-pass shadow volume counting. To account for this clipping, we can perform the following extended stencil operations, which are derived from the depth-pass algorithm: 1. Render back faces. Increment stencil value for both depth-pass and depth-fail (effectively disabling depth test). 2. Render front faces. Decrement stencil value for both depth-pass and depth-fail (effectively disabling depth test). 3. Render back faces. Decrement stencil value for depthpass. Do nothing for depth-fail. 4. Render front faces. Increment stencil value for depth-pass. Do nothing for depth-fail. The purpose of the first two steps in the above stenciling operation is to leave positive values in the stencil buffer when the eye point is inside the shadow volume, thus accounting for the clipping of the front faces of the shadow volume. The third and fourth steps are actually the original depth-pass algorithms with the ordering reversed. Rearranging the steps, we get: 1. Render back faces. Increment stencil value for both depth-pass and depth-fail (effectively disabling depth test). 2. Render back faces. Decrement stencil value for depthpass. Do nothing for depth-fail. 3. Render front faces. Decrement stencil value for both depth-pass and depth-fail (effectively disabling depth test).
206
4. Render front faces. Increment stencil value for depth-pass. Do nothing for depth-fail. It is now obvious that some of the above steps actually cancel each other out. Simplifying the above stenciling operations, we get: 1. Render back faces. Increment stencil value for depth-fail. Do nothing for depth-pass. Disable draw to frame and depth buffer. 2. Render front faces. Decrement stencil value for depth-fail. Do nothing for depth-pass. Disable draw to frame and depth buffer. The two-step stenciling operation above is the complete depth-fail algorithm. It is known as the depth-fail stencil shadow volume technique since we manipulate the stencil values only when the depth test fails. Depth-fail is also commonly known as z-fail. The depth-fail algorithm is really just the opposite of the depth-pass algorithm. The depth-fail stencil operations, however, do not falter when the eye point is in the shadow volume:
Figure 6: Depth-fail works even when the eye point is within the shadow volume.
207
Figure 6 again depicts the situation in which the eye point is within a shadow volume. However, by implementing the depth-fail stencil operations, the resultant values in the stencil buffer are correct. Figure 7 shows that the depth-fail algorithm would work for normal situations in which the eye point is outside the shadow volumes. The reader should inspect other possible scenarios to convince himself of the viability of the depth-fail algorithm.
To put non-zero values into the stencil buffer, the depth-fail technique depends on the failure to render the shadow volumes back faces with respect to the eye position. This means that the shadow volume must be a closed volume; the shadow volume must be capped at both the front and back ends (even if the back end is at infinity). Without capping, the depth-fail technique would produce erroneous results. Amazing as it may sound, you can even cap the shadow volume at infinity.
208
As shown in Figure 8, the front and back capping (bold lines) create a closed shadow volume. Both the front and back capping are considered back faces from the two eye positions. With depth-fail stenciling operations, the capping will create correct non-zero stencil values. There are a few ways to create the front and back capping. Mark Kilgard [7] described a non-trivial method of creating the front capping. The method basically involves the projection of the occluders back-facing geometries onto the near clip plane and uses these geometries as the front capping. Alternatively, we can build the front capping by reusing the front-facing triangles with respect to the light source. The geometries used in the front capping can then be extruded, with their ordering reversed, to create the back capping. Reversing the ordering ensures that the back capping faces outward from the shadow volume. In fact, we must always ensure that the primitives (in our case, triangles) that define the entire shadow volume are outward facing, as shown in Figure 9. It must be noted that rendering closed shadow volumes is somewhat more expensive than using depth-pass without shadow volume capping. Besides a larger primitive count for the shadow volume due to the capping geometries, additional resources are needed to compute and store the front and back capping. We go into the details of capping shadow volumes, including some possible optimization techniques, in the Shadow Volume Capping section.
209
210
Figure 10: A finite shadow volume may fail to cover all objects adequately.
Finite shadow volume affects both depth-pass and depth-fail implementations, but lets assume the case of a depth-fail implementation in Figure 10. With the light close to object A, a finite shadow volume may not be extended far enough to cover object B properly. The ray from the eye toward object B ends up with a fragment stencil value of 0 when in fact it should be non-zero! An infinite shadow volume would ensure that no matter how close the object is to an occluder, the resultant shadow volume would cover all the objects in the scene. We discuss how to create an infinite shadow volume by extruding silhouette vertices to infinity using homogenous coordinates in the Forming the Shadow Volume and Vertex Shader Implementation (InfiniteGPU) sections.
Ghost Shadow
While extruding geometries by a huge distance or to infinity helps avoid the problem of finite shadow volume cover, it also generates another problem: Imagine two players in a dungeon first-personshooter (FPS) game, roaming in adjacent rooms separated by a solid brick wall. A table lamp in one of the rooms causes one of the players to cast a shadow onto the brick wall separating the rooms.
211
The player in the other room would see this shadow since the shadow volume extrudes out to infinity. The solid brick wall suddenly feels like a thin piece of paper with a ghost shadow on it. Fortunately, by utilizing occlusion information and other culling techniques, we can restrict shadow volume rendering to individual rooms and avoid this kind of situation. Figure 11 shows a more awkward situation, whereby the camera sees both the occluder and its shadow on one side and the ghost shadow on the other side of the terrain. Handling such a situation is tricky because the shadow volume must not be extruded beyond the terrain. Determination of the correct extrusion distance is not trivial, especially if the light and occluder are free to move around the scene. This scenario is very possible, especially for flight simulations or aerial combat games.
The only feasible solution to avoid both the finite shadow volume cover (Figure 10) and ghost shadow (Figure 11) is to impose limitations on the placement of light sources and occluders in a scene. If we can be sure that an occluder can never get closer than a certain distance to a shadow-casting light source, we can safely estimate the largest distance we would need to extrude the shadow volume in order to provide adequate shadow cover while not causing ghost shadows. It is thus an added responsibility of level designers to ensure that the occluder and light source
212
placement in a scene do not compromise or break the underlying stencil shadow volume implementation. We discuss the importance of scene management in more detail in the Efficiency and Robustness section.
213
Figure 12: Shadow volume clipped at near clip plane causing depth-pass errors
proper depth testing. The ray from the eye in Figure 13 represents a case whereby the depth-fail technique generates errors since the far plane had clipped the back face of the shadow volume. The clipping of the back faces destroys the chance for the depth-fail algorithm to increment the stencil buffer and thus results in incorrect shadowing.
Figure 13: Shadow volume clipped at far clip plane, causing depth-fail errors
214
A simple solution to these problems is to move the clipping planes to avoid clipping the shadow volume. Adjusting the near plane to avoid the depth-pass problem is not feasible because doing so will greatly affect the depth precision range and may have negative impacts on other operations that are dependent on the depth buffer values. On the other hand, however, shifting the far plane by an infinite distance will actually solve the far plane clipping problems for depth-fail. Lets first take a look at attempts to solve the depth-pass near clip plane problem, which happens to be one of the trickiest issues that one could encounter in real-time graphics. Mark Kilgard [7] presented interesting ideas on how to handle the two possible scenarios when shadow volumes intersect the near clip plane. The idea was to cap the shadow volume at the near clip plane so that the previously clipped front-facing geometries could now be rendered at the near clip plane instead. The first scenario is when all the vertices of the occluders silhouette project to the near clip plane. In this case, a quad strip loop is generated from all front-facing vertices within the silhouette of the occluder. The quad strip loop is then projected onto the near clip plane, thus forming a capping for the shadow volume. The second scenario occurs when only part of the shadow volume projects onto the near clip plane. This proved to be much more difficult to handle than the previous scenario. To his credit, Kilgard devised an elaborate system to filter out the vertices of triangles (facing away from the light) that should be projected onto the near clip plane in order to cap the shadow volume. The capping of shadow volumes at the near clip plane gave rise to another problem: depth precision. Rendering geometries at the near clip plane is analogous to rolling a coin; the coin can drop down both sides easily and unpredictably. This means that the near plane may still clip the vertices that were meant to cap the shadow volume. To overcome this, Kilgard devised yet another method that builds a depth range ledge from the eye point to the near plane. The idea is to render the shadow volume from a depth range of [0.0, 1.0], while normal scene rendering occurs within a depth range of [0.1, 1.0]. The ledge could be built into the view frustum
215
by manipulating the perspective projection matrix. Once in place, the near clip plane capping of shadow volumes is done at a depth value of 0.05, which is half of the ledge. This idea is indeed original, but it does not totally solve the problem. Cracks or holes in the near plane shadow capping occur very frequently, resulting in erroneous results. The conclusion with the near clip plane problem is that there are really no trivial solutions. At least, there is no known foolproof solution to the problem at the time of publication. This makes the depth-pass technique less robust and confines its spectrum of application to those situations where near plane clipping of the shadow volume is not possible (e.g., real-time strategy (RTS) games). Fortunately, there is an elegant solution to the far plane clipping problem that plagues the depth-fail technique. The antidote to the problem is simply an infinite perspective view projection or simply an infinite view frustum. By projecting the far plane all the way to infinity, there is no mathematical chance of the shadow volume being clipped by the far plane. Even if the shadow volume were extruded to infinity, the far plane at infinity would still not clip it after some projection matrix alteration. The derivation for a left-handed Direct3D perspective projection matrix is presented here. For the derivation of such a matrix applicable to OpenGL, please refer to Eric Lengyel [8]. Lets start by looking at a standard left-handed perspective projection matrix in Direct3D: fovw 0 cot 2 fovh cot 0 2 P = 0 0 0 0 Variables: n: f: fovw: fovh: 0 0 f f -n - fn f -n 0 0 1 0
near plane distance far plane distance horizontal field of view in radians vertical field of view in radians
216
A far plane at infinity means that the far plane distance needs to approach . Hence, we get the following perspective projection matrix when the far plane distance goes toward the infinity limit: fovw 0 0 cot 2 fovh cot 0 P = lim P = 0 f 2 0 0 1 0 0 -n 0 0 1 0
Equation (2) defines a perspective projection view that extends from the near plane to a far plane at infinity. But, are we absolutely sure that the vertices that we extruded to infinity using the 4D homogeneous vector do not get clipped at infinity? Sadly, we cannot be 100 percent sure of this due to limited hardware precision. In reality, graphics hardware sometimes produces points with a normalized z-coordinate marginally greater than 1.0, which happens to be the limit at the far plane. These values are then converted into integers for use in the depth buffer. This is going to wreak havoc, since our stencil operations depend wholly on the depth value testing. (As a side note, the DirectX 9.0 Direct3D API features floating-point z-buffer format, which may alleviate this situation. However, it is applicable only to hardware that supports depth buffer using floating-point.) Fortunately, there is a workaround for this problem. The solution is to map the z-coordinate values of our normalized device coordinates from a range of [0, 1] to [0, 1e], where e is a small positive constant. This means that we are trying to map the z-coordinate of a point at infinity to a value that is slightly less than 1.0 in normalized device coordinates. (OpenGL has normalized device coordinates of 1.0 to 1.0.) Let Dz be the original z-coordinate value and D'z be the mapped z-coordinate. The mapping can be achieved using the equation shown below: D' z = D z (1 - e )
217
Now, lets make use of equation (2) to transform a point A from camera space (Acam) to clip space (Aclip). Note that camera space is also commonly referred to as eye space. fovw 0 0 cot 2 fovh cot 0 = Acam P = [ A x A y A z Aw ] 0 2 0 0 1 -n 0 0 0 0 1 0
Aclip
Aclip
Lets factor the desired range mapping into equation (3) by replac( Aclip ) z and D' with ( A' clip ) z : ing Dz with z ( Aclip )w ( Aclip )w
( A' ) (A )
clip clip
(A ) 1 -e ( ) (A )
clip z clip w
Simplifying equation (5) by using the values given by equation (4), we get:
( A' )
clip
= A z (1 - e ) + nAw (e - 1)
Using equation (6), we can enforce our range mapping into the projection matrix P given by equation (2) to get the following:
218
0 0 1 0
Thus, we can use the perspective projection matrix given in equation (7) without fear of far plane clipping of shadow volumes occurring at infinity! You might wonder whether stretching the view frustum volume all the way to infinity would impact depth buffer precision. The answer is yes, it does affect precision, but the loss of precision is really negligible. The amount of numerical n range lost when extending the far plane out to infinity is only . f Say our original near clip plane is at 0.1 meters, and the far clip plane is at 100 meters. This range corresponds to a depth range of [0, 1.0]. We then extend the far plane distance to infinity. The range from 0.1 meters to 100 meters would now correspond to a depth range of [0, 0.999]. The range from 100 meters to infinity would correspond to a depth range of [0.999, 1.0]. The loss in depth buffer precision is really not a big impact at all. The larger the difference between the n and f values, the smaller the loss in depth buffer precision. You can find the above derivations and many other related mathematical derivations in Eric Lengyels book [9]. It should be noted that using an infinite view frustum means that we have to draw more geometries. This may pose a potential performance problem. The infinite view frustum projection is really just a software solution to the far plane clipping problem. Mark Kilgard and Cass Everitt [10] presented a hardware solution to the problem instead of using an infinite view frustum. Newer graphics hardware now supports a rendering technique called depth clamping. In fact, the depth-clamping extension, NV_depth_clamp, was specifically added to nVidias GeForce3 and above graphics cards to solve the far plane clipping problem for shadow volume implementations.
219
When active, depth clamping forces all the objects beyond the far clip plane to be drawn at the far clip plane with the maximum depth value. This means that we can project the closed shadow volume to any arbitrary distance without fear of it being clipped by the far plane, as the hardware will handle the drawing properly. With such automatic support from graphics hardware, depth-fail shadow volumes become easier to implement. We can extend the shadow volume to infinity while rendering with our finite view frustum and still get correct depth-fail stencil values! However, the trade-off is hardware dependence, unless hardware vendors and graphics APIs such as Direct3D and OpenGL commonly support depth clamping in the future. If we want the depth-fail shadow volume to work for any graphics card (with stenciling support at least), we have to use the infinite view frustum projection instead of the depth-clamping extension. With a good background on the stencil shadow volume algorithms and their associated problems, it is time to plunge into their implementations. The following sections present two different approaches to implementing stencil shadow volumes. The first approach is the common way of determining the occluders silhouette on the CPU and uploading the new shadow volume vertices onto the hardware. The second approach makes use of the programmable pipeline (vertex shader) to construct the shadow volume on the hardware itself, thereby saving the cost of uploading new geometries every frame. It should be noted here that a one-off preprocessing of the occluders geometries is necessary for the vertex shaders (GPU) implementation. The preprocessing adds new vertices into the source data set in order to facilitate the construction of the shadow volume on the hardware. With optimization in place, preprocessed data sets typically contain around two times more vertices.
220
Implementation on CPU
For this section on CPU implementation, the reader should refer to both the DepthPassCPU and DepthFailCPU samples, which can be found on the companion CD. Note that both samples are based on DirectX 8.1. A list of general steps for implementing shadow volumes on the CPU is presented shortly. The subsequent discussion of the two CPU-based samples will closely follow these steps.
How It Is Done
Lets collate what we have learned and try to come up with the steps to do both depth-pass and depth-fail stencil shadow volumes on the CPU. A general list of steps to implement stencil shadow volumes is: 1. Render the scene to fill the depth buffer with the correct z values. 2. Select a light source. Clear the stencil buffer if this is the first light. Calculate the silhouette of all the occluders with respect to the light source. 3. Extrude the silhouette away from the light source to a finite or infinite distance to form the shadow volumes and generate the capping if the depth-fail technique was used. 4. Set up the stencil operations and render the shadow volumes using the depth-pass or depth-fail technique. 5. Repeat steps 2 to 4 for all selected lights in the scene. 6. Using the updated stencil buffer, do a lighting pass to shade (or make it a tone darker) the pixels that correspond to non-zero stencil values. The above list of steps is just one way to achieve a shadowed scene using the stencil buffer values. Many other workable approaches to creating a shadowed scene exist. For example, per-pixel attenuation techniques can be combined with the stencil shadow volume algorithm so that instead of darkening the pixels in shadows, the pixels are not drawn at all. We go through the
221
Silhouette Determination
As described in step 2 of the previous section, once a light source is selected, the first step to constructing a shadow volume is to determine the silhouette of the occluder. The stencil shadow algorithm requires that the occluders be made up of closed triangle meshes. This means that every edge in the model must only be shared by two triangles, thus disallowing any holes that would expose the interior of the model. The reasons for this requirement are obvious, as any seams or holes (formed for example by t-junctions) would greatly complicate the silhouette determination algorithm. In cases where the original occluders geometries are used for forming the front capping, non-closed meshes will also throw the stencil counting off-balance. However, there are ways to circumvent a few special cases of non-closed triangle meshes for use in shadow volume implementations, but these are beyond the scope of this article. In silhouette calculations, we are only interested in the edges shared by a triangle that faces the light source and another triangle that faces away from the light source. Lets assume that we are working with an indexed triangle mesh.
222
Figure 14 shows one side of a box that is made up of four triangles with consistent clockwise winding. The broken lines indicate the redundant internal edges, since we are only interested in the solid line that forms the outline of the box. The redundant internal edges are indexed twice, as they are shared by two triangles. We can take advantage of this property to come up with a simple method to determine the silhouette edges. 1. Loop through all the models triangles. 2. If a triangle faces the light source (dot product of lights direction vector and triangle face normal is greater than zero): a. Insert the three edges (pair of vertices) of the triangle into an edge stack. b. Check for previous occurrence of each edge or its reverse in the stack. c. If an edge or its reverse is found in the stack, remove both edges. 3. Edges left in the stack form the silhouette. The above algorithm ensures that all the internal edges will eventually be removed from the stack, since they are indexed by more than one triangle. This silhouette determination method is implemented in both the DepthPassCPU and DepthFailCPU samples, as the function InsertEdge() called from BuildShadowVolume(). The following code snippet is taken from the BuildShadowVolume() function in the DepthPassCPU sample.
01 02 03 04 05 06 07 08 09 10 11 MESHVERTEX* pVertices; WORD* pIndices; // Lock the geometry buffers pMesh->LockVertexBuffer( 0L, (BYTE**)&pVertices ); pMesh->LockIndexBuffer( 0L, (BYTE**)&pIndices ); DWORD dwNumVertices = pMesh->GetNumVertices(); DWORD dwNumFaces = pMesh->GetNumFaces(); // Allocate a temporary edge list WORD* pEdges = new WORD[dwNumFaces*6];
223
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
DWORD dwNumEdges = 0; // For each face, check all 3 edges for( DWORD i=0; i<dwNumFaces; i++ ) { WORD wIndex0 = pIndices[3*i+0]; WORD wIndex1 = pIndices[3*i+1]; WORD wIndex2 = pIndices[3*i+2]; D3DXVECTOR3 v0 = pVertices[wIndex0].p; D3DXVECTOR3 v1 = pVertices[wIndex1].p; D3DXVECTOR3 v2 = pVertices[wIndex2].p; // Note that vLight has already been transformed to object space. This saves some computation work // Cosine value larger than 0.0 means light-facing since angle between // light vector vLight and the face normal is within -90 to 90 degrees // Face normal is computed in order to use welded models D3DXVECTOR3 vCrossValue1(v2-v1); D3DXVECTOR3 vCrossValue2(v1-v0); D3DXVECTOR3 vFaceNormal; D3DXVec3Cross( &vFaceNormal, &vCrossValue1, &vCrossValue2 ); // Take note that we are doing a recalculation of vLightDir, or direction // vector of incoming light ray by using the first vertex of a face (3 vertices) to represent that face. // The dot product test is also only done once per face. D3DXVECTOR3 vLightDir = vLight - v0; // Direction vector of incoming light rays if( D3DXVec3Dot( &vFaceNormal, &vLightDir ) >= 0.0f ) { InsertEdge( pEdges, dwNumEdges, wIndex0, wIndex1 ); InsertEdge( pEdges, dwNumEdges, wIndex1, wIndex2 ); InsertEdge( pEdges, dwNumEdges, wIndex2, wIndex0 ); }
38 39 40 41 42 43 44 } 45
Note that we have to compute the face normal for every face in the code from line 29 through 32. The calculation of face normals coupled with the use of indices instead of positions for comparison later in the InsertEdge() function will allow us to make use of welded models. Welded models result in better performance due
224
to reduced polygon counts for the shadow volume generated. We discuss the advantages of welded models in the Efficiency and Robustness section. The above code will also work for non-welded models. Both the DepthPassCPU and DepthFailCPU samples make use of welded models. On line 37, the vector vLightDir is calculated from the light position and the first vertex of the current face. Hence, we do only one dot product test for each face, as we are using the face normal, which is the same for all three vertices. The dot product at line 38 will insert all three edges of the face into an edge stack through the InsertEdge() function if it is light facing. The following is the code for the InsertEdge() function:
01 VOID CShadow::InsertEdge( WORD* pEdges, DWORD& dwNumEdges, WORD v0, WORD v1 ) 02 { 03 for (DWORD i=0; i < dwNumEdges; i++) 04 { 05 if( ( pEdges[2*i+0] == v0 && pEdges[2*i+1] == v1 )||( pEdges[2*i+0] == v1 && pEdges[2*i+1] == v0 ) ) 06 { 07 if( dwNumEdges > 1 ) 08 { 09 pEdges[2*i+0] = pEdges[2*( dwNumEdges-1 )+0]; 10 pEdges[2*i+1] = pEdges[2*( dwNumEdges-1 )+1]; 11 } 12 dwNumEdges--; 13 return; 14 } 15 } 16 17 pEdges[2*dwNumEdges+0] = v0; 18 pEdges[2*dwNumEdges+1] = v1; 19 dwNumEdges++; 20 }
The InsertEdge() function tests for recurrences of new edges and eliminates those that are duplicated. After running through the entire model, the edges left over in the stack represent the silhouette edges that we need.
225
Eric Lengyel [8] presented another silhouette determination algorithm that makes use of the consistent winding (counterclockwise for OpenGL) of vertices. The method requires two passes on all the triangles of the model to filter in all the edges shared by pairs of triangles. The resultant edge list then undergoes the dot product operations to get the edges that are shared by a light-facing triangle and a non-light-facing triangle. It is important to note that silhouette determination is one of the two most expensive operations in stencil shadow volume implementation. The other is the shadow volume rendering passes to update the stencil buffer. These two areas are prime candidates for aggressive optimizations, which we discuss in detail in the concluding sections. Now lets get on to the business of forming the shadow volume using the silhouette edges that we have obtained.
226
Figure 15: Extrusion and the forming of shadow volume for a point light source
As shown in Figure 15 above, the silhouette edge defined by vertices v1 and v2 is used to create two more vertices, v3 and v4. The four vertices are then used to create a quad to form the side of the shadow volume. The arrows within the quad show the clockwise ordering of the vertices that is needed to make the side face outward. This is implemented in the function BuildShadowVolume() for both the DepthPassCPU and DepthFailCPU samples. With regard to distance needed to extrude vertices v1 and v2 to form v3 and v4, both the DepthPassCPU and DepthFailCPU samples employ a finite extrusion distance. We discuss infinite shadow volume extrusion shortly. In the Finite Shadow Cover and Ghost Shadow sections, we discussed the two scenarios whereby infinite or finite shadow volume extrusion might be desirable for different reasons. The implementation for finite extrusion is trivial. Referring to Figure 15 again, a light vector is formed by making use of the light position and the selected vertex. The light vector defines the direction
227
vector of the incoming light ray at that vertex. The extruded vertex can then be computed by extending the selected vertex by a finite distance in the direction of the light vector. Take note that it is not advisable to extrude the vertex by a multiple of the magnitude of the light vector. This is because the light vector is unique for all vertices (assuming point light sources), and the magnitude can differ wildly. If the magnitude of the light vector is too small (e.g., the light is very close to the vertex), the vertex may not be extruded far enough to provide adequate shadow cover. Hence, extruding the vertices by an absolute distance is recommended. This is easily done by normalizing the light vector and multiplying individual components by the absolute distance to be extruded. Lastly, we insert two triangles using the original and extruded vertices to form the sides of the shadow volume. The following code snippet from the BuildShadowVolume() function in the DepthPassCPU sample accomplishes what we have just discussed:
01 // For each silhouette edge, duplicate it, 02 for( i=0; i<dwNumEdges; i++ ) 03 { 04 D3DXVECTOR3 v1 = pVertices[pEdges[2*i+0]].p; 05 D3DXVECTOR3 v2 = pVertices[pEdges[2*i+1]].p; 06 07 D3DXVECTOR3 v3; 08 D3DXVECTOR3 v4; 09 D3DXVECTOR3 vExtrusionDir; // Direction vector from light to vertex, or rather the direction to extrude 10 11 // Extrusion can be tricky. It is not advisable to extrude vertices by a multiple of the magnitude of 12 // vExtrusionDir. This is because the magnitude may be so small that even a large multiple would not extrude 13 // the vertices far enough. Results will be unpredictable if either light source of occluders are dynamic 14 // objects. Hence, we normalize the vExtrusionDir vector before multiply by the ABSOLUTE distance 15 // that we want to extrude the vertex to. 16 17 vExtrusionDir = v1 - vLight; // Compute a new extrusion direction for new vertex 18 D3DXVec3Normalize( &vExtrusionDir, &vExtrusionDir );
228
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 }
v3.x = v1.x + vExtrusionDir.x * g_fExt; v3.y = v1.y + vExtrusionDir.y * g_fExt; v3.z = v1.z + vExtrusionDir.z * g_fExt; vExtrusionDir = v2 - vLight; // Compute a new extrusion direction for new vertex D3DXVec3Normalize( &vExtrusionDir, &vExtrusionDir ); v4.x = v2.x + vExtrusionDir.x * g_fExt; v4.y = v2.y + vExtrusionDir.y * g_fExt; v4.z = v2.z + vExtrusionDir.z * g_fExt; // Add a quad (two triangles) to m_pVertices[m_dwNumOfVertices++] m_pVertices[m_dwNumOfVertices++] m_pVertices[m_dwNumOfVertices++] the vertex list = v1; = v4; = v2;
As discussed previously, we may need to extrude the silhouette edges to infinity to avoid the situation shown in Figure 10, where a finite shadow volume extrusion fails to cover all the shadow receivers in a scene. However, it is not compulsory to extrude the silhouette edges to infinity if we can ensure that the situation in Figure 10 will never happen in our scene. In practical cases, a large value would normally be more than adequate. Mark Kilgard [7] introduced the trick of using the w value of homogenous coordinates to render semi-infinite vertices. In a homogenous coordinates system, we represent a point or vector as (x, y, z, w), with w being the fourth coordinate. For points, w is equal to 1.0. For vectors, w is equal to 0.0. The homogeneous notation is extremely useful for transforming both points and vectors. Since translation is only meaningful to points and not vectors, the value of w plays an important role in transforming only points and not vectors. This can be easily deduced, since the translation values of a transformation matrix are on either the fourth column or the fourth row, depending on the matrix convention. By setting the w value of the infinity-bound vertices to 0.0,
229
we change the homogenous representation from that of a 3D point to a 3D vector. The rendering of a vector (w = 0.0) in clip space would be semi-infinite. It is important to note that we should only set the w values to 0.0 before transformation to clip space. Technically, this implies that we want to render the vertex as a 3D vector of the form (x, y, z, 0). Rendering such a vertex is possible in Direct3Ds fixed-function pipeline by using the flexible vertex format D3DFVF_XYZRHW. This is because when we set the flexible vertex format to D3DFVF_XYZRHW, we are bypassing Direct3Ds transformation and lighting pipeline. Our program becomes responsible for transforming and lighting the vertices, as Direct3D would merely pass the input to the hardware for rasterization. From the DirectX 8.1 documentation:
If you include the D3DFVF_XYZRHW flag in your vertex format description, you are telling the system that your application is using transformed and lit vertices. This means that Microsoft Direct3D doesnt transform your vertices with the world, view, or projection matrices, nor does it perform any lighting calculations. It assumes that your application has taken care of these steps. This fact makes transformed and lit vertices common when porting existing 3D applications to Direct3D. In short, Direct3D does not modify transformed and lit vertices at all. It passes them directly to the driver to be rasterized. The system requires that the vertex position that you specify be already transformed. The x and y values must be in-screen coordinates, and z must be the depth value of the pixel to be used in the z-buffer. Z values can range from 0.0 to 1.0, where 0.0 is the closest possible position to the user and 1.0 is the farthest position still visible within the viewing area. Immediately following the position, transformed and lit vertices must include a reciprocal of homogeneous W (RHW) value. RHW is the reciprocal of the W coordinate from the homogeneous point (x,y,z,w) at which the vertex exists in projection space. This value often works out to be the distance from the eyepoint to the vertex, taken along the z-axis.
230
From the above explanation, RHW means reciprocal homogeneous w value (1/w), which is the result of normalizing the w component. However, we cannot explicitly represent our vertices as infinite. We can get around this by setting the w component of the vertex to 0.0 before applying the clip space transformation (world*view*projection). Originally, the homogenization process during perspective projection transformation would divide all four components by the w component in order to normalize the w component to 1.0 (Moller and Haines [11]). However, by using the D3DFVF_XYZRHW vertex format, we must implement the homogenization process ourselves in order to complete the transformation to clip space. Next, we need to manually map the x and y values from clip space to screen coordinates. A rectangular clipping volume defines the clip space with an x-coordinate range of [1.0...1.0] and a y range of [1.0...1.0]. We need to map to screen coordinates that range from (0.0, 0.0) at the top-left corner to (screen horizontal resolution, screen vertical resolution) at the bottom right corner. The screen coordinates can finally be passed on to Direct3D for rasterization. As described above, rendering vectors (w=0) using the fixedfunction pipeline can be both error-prone and inefficient. A lot of geometry transformation and mapping needs to be done, and the computation load shifts toward the CPU while the graphics hardwares powerful arithmetic processors lay wasted. Ideally, the infinite extrusion of geometries can be done more easily in a vertex program, since we are already transforming the vertices in a vertex shader. In fact, this is done in the InfiniteGPU sample using a simple vertex program that is discussed in the Implementation on GPU (Shaders) section. Note that we do not need to light the vertices, since surface color values have no meaning for the shadow volume polygons. The w-coordinate demo at nVidia [28] is a simple program for visualizing the rendering of vertices with different w-coordinate values.
231
232
Figure 16: Front and back capping to create closed shadow volumes
Figure 16 shows two sets of images employing different geometries to close the shadow volume. The first row depicts a closed shadow volume formed by a front and back capping that reuses the occluders light-facing geometries. The second row shows a closed shadow volume with a front capping that reuses light-facing geometries of the occluder and a triangle-fan back capping constructed from extruded silhouette edges. The triangle-fan back capping results in less geometry and hence requires less memory and rendering resources. While optimizing the back capping with a triangle-fan is trivial, the same cannot be said for the front capping. This is due to the fact that the occluders self-shadowing totally depends on the accuracy of the front capping. To be precise, the most accurate front capping is one that is created from the actual front-facing geometries of the occluder. Such a front capping would ensure that any grooves or knobs on the occluders surface would be correctly
233
self-shadowed. When the occluder is too small for any self-shadowing to be noticeable (Diablo and RTS-style games) or when the light-facing side of a static occluder is generally flat, we can make use of triangle strips formed using the silhouette edges to cut down on the amount of front-capping geometry.
234
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67
m_pd3dDevice->SetRenderState( D3DRS_STENCILPASS, D3DSTENCILOP_INCR ); // Show shadow volume front faces? if ( m_bShowShadowVolFrontFace ) { m_pd3dDevice->SetMaterial( &m_ShadowVolFrontFaceMaterial ); m_pd3dDevice->SetRenderState( D3DRS_ALPHABLENDENABLE, TRUE ); m_pd3dDevice->SetRenderState( D3DRS_SRCBLEND, D3DBLEND_SRCCOLOR ); m_pd3dDevice->SetRenderState( D3DRS_DESTBLEND, D3DBLEND_DESTALPHA ); } else m_pd3dDevice->SetRenderState( D3DRS_COLORWRITEENABLE, 0x00000000 ); // Draw front side of shadow volume in stencil/z only m_pd3dDevice->SetTransform( D3DTS_WORLD, &m_matObject); m_pShadow->RenderShadowVolume( m_pd3dDevice ); m_pd3dDevice->SetTransform( D3DTS_WORLD, &m_matObject2); m_pShadow2->RenderShadowVolume( m_pd3dDevice ); // Now reverse cull order so back sides of shadow volume are written. m_pd3dDevice->SetRenderState( D3DRS_CULLMODE, D3DCULL_CW ); // Decrement stencil buffer value if depth test passes m_pd3dDevice->SetRenderState( D3DRS_STENCILPASS, D3DSTENCILOP_DECR ); // Show shadow volume back faces? if ( m_bShowShadowVolBackFace ) { m_pd3dDevice->SetMaterial( &m_ShadowVolBackFaceMaterial ); m_pd3dDevice->SetRenderState( D3DRS_COLORWRITEENABLE, 0x0000000F ); m_pd3dDevice->SetRenderState( D3DRS_ALPHABLENDENABLE, TRUE ); m_pd3dDevice->SetRenderState( D3DRS_SRCBLEND, D3DBLEND_SRCCOLOR ); m_pd3dDevice->SetRenderState( D3DRS_DESTBLEND, D3DBLEND_DESTALPHA ); } else m_pd3dDevice->SetRenderState( D3DRS_COLORWRITEENABLE, 0x00000000 ); // Draw back side of shadow volume in stencil/z only m_pd3dDevice->SetTransform( D3DTS_WORLD, &m_matObject); m_pShadow->RenderShadowVolume( m_pd3dDevice ); m_pd3dDevice->SetTransform( D3DTS_WORLD, &m_matObject2); m_pShadow2->RenderShadowVolume( m_pd3dDevice );
235
68 69 70 71 72 73 74 75 76 77 }
// Restore render states m_pd3dDevice->SetRenderState( m_pd3dDevice->SetRenderState( m_pd3dDevice->SetRenderState( m_pd3dDevice->SetRenderState( m_pd3dDevice->SetRenderState( m_pd3dDevice->SetRenderState( return S_OK;
D3DRS_COLORWRITEENABLE, 0x0000000F ); D3DRS_SHADEMODE, D3DSHADE_GOURAUD ); D3DRS_CULLMODE, D3DCULL_CCW ); D3DRS_ZWRITEENABLE, TRUE ); D3DRS_STENCILENABLE, FALSE ); D3DRS_ALPHABLENDENABLE, FALSE );
Note that prior to the calling of the RenderShadowVolume() function, the depth buffer had already been filled with the appropriate depth values during the rendering pass of step 1, as discussed in the How It Is Done section. Lines 5 and 6 disable writing to the depth buffer and enable stencil testing. The code within lines 15 to 25 sets up the stencil operations prior to rendering the shadow volume. Line 15 forces the stencil test to always pass, while lines 16 and 17 instruct Direct3D to retain the stencil values in case of depth fail or stencil test fail. Line 20 sets the stencil reference value to 1. Lines 21 and 22 set the stencil comparison mask and write mask to include every bit. The following is the complete test function employed by the Direct3D API during stencil tests:
(StencilRef & StencilMask) CompFunc (StencilBufferValue & StencilMask)
For more information on the other uses of stencil buffers, please refer to [3]. Line 25 tells Direct3D to increment the stencil value if both stencil and depth tests pass. The stencil test has already been set to always pass in line 15, so it is really only the depth test in question here. Lines 28 to 36 either disable the color writes to frame buffer or set up alpha blending to reveal the front faces of the shadow volumes. Next, we proceed to render the shadow volumes of our occluders in lines 39 through 42. This is in agreement with the first step of the depth-pass algorithm presented in the Depth-pass (z-pass) section. Following the second step of the depth-pass algorithm, line 45 reverses the culling mode so that we can start drawing the back
236
faces of the shadow volume. Line 48 sets the stencil operation to decrement the stencil values if the stenciling and depth tests pass. Again, the stenciling test always passes, and it is only the depth test that we are really testing against. Lines 51 to 60 either disable the color writes to frame buffer or set up alpha blending to reveal the back faces of the shadow volumes. Lines 63 to 66 draw the shadow volumes again with the culling reversed. Lines 69 through 74 restore the render states to their original settings. That completes the rendering of the shadow volumes for the depth-pass algorithm in the DepthPassCPU sample. We should note that the sequence of the depth-pass algorithm that we are employing is really inconsequential. This is because at lines 25 and 48, we set the stencil increment and decrement operation as wrapping, which has been available since DirectX 6. Thus, we can start with either incrementing or decrementing the stencil values. This is because the stencil buffer can only contain values from 0 to 2n1, where n is the stencil bit depth. When the maximum stencil value is reached, the stencil value is wrapped to 0 with the next increment operation. Similarly, when the minimum stencil value of 0 is reached, the next decrement operation wraps the stencil to 2n1. This ensures that the shadow volume counting will not be thrown off balance due to saturation at maximum or minimum stencil value. This guarantees that we leave behind non-zero stencil values for pixels with unbalanced shadow volume entry and exit counts. It also means that the bit depth of the stencil buffer is not important to us, as a 2-bit stencil buffer (if one exists) will work as well as an 8-bit stencil buffer. If we opt for stencil value clamping (e.g., setting D3DRS_STENCILPASS to D3DSTENCILOP_INCRSAT to clamp to the maximum value), we will lose track of the correct shadow volume count if the stencil value gets saturated at 2n1, and the stencil values will be incorrect. Lets move on to adding the shadows into the scene now! The only thing we need to do now is make use of the stencil values and shade the appropriate pixels in the scene, as described by step 6 in the How It Is Done section. This is done with the DrawShadow() function in the DepthPassCPU sample.
237
01 HRESULT CDepthPass::DrawShadow() 02 { 03 // Set renderstates: disable z-buffering, enable stencil, and turn on 04 // alpha blending 05 m_pd3dDevice->SetRenderState( D3DRS_ZENABLE, FALSE ); 06 m_pd3dDevice->SetRenderState( D3DRS_STENCILENABLE, TRUE ); 07 m_pd3dDevice->SetRenderState( D3DRS_ALPHABLENDENABLE, TRUE ); 08 m_pd3dDevice->SetRenderState( D3DRS_SRCBLEND, D3DBLEND_SRCALPHA ); 09 m_pd3dDevice->SetRenderState( D3DRS_DESTBLEND, D3DBLEND_INVSRCALPHA ); 10 11 // Only write where stencil val >= 1 12 m_pd3dDevice->SetRenderState( D3DRS_STENCILREF, 0x1 ); 13 m_pd3dDevice->SetRenderState( D3DRS_STENCILFUNC, D3DCMP_LESSEQUAL ); 14 m_pd3dDevice->SetRenderState( D3DRS_STENCILPASS, D3DSTENCILOP_KEEP ); 15 16 // Draw a big, gray square 17 m_pd3dDevice->SetVertexShader( D3DFVF_BIGSQUAREVERTEX ); 18 m_pd3dDevice->SetStreamSource( 0, m_pBigSquareVB, sizeof(BIGSQUAREVERTEX) ); 19 m_pd3dDevice->DrawPrimitive( D3DPT_TRIANGLESTRIP, 0, 2 ); 20 21 // Restore render states 22 m_pd3dDevice->SetRenderState( D3DRS_ZENABLE, TRUE ); 23 m_pd3dDevice->SetRenderState( D3DRS_STENCILENABLE, FALSE ); 24 m_pd3dDevice->SetRenderState( D3DRS_ALPHABLENDENABLE, FALSE ); 25 26 return S_OK; 27 }
To shade the pixels with non-zero stencil values, we first disable depth testing in line 5 and enable stencil testing in line 6. Alpha blending with the blending parameters is set up in lines 7 through 9. Next comes the critical stenciling operations set up in lines 12 through 14. We use a reference stencil value of 1 and do a less than or equal comparison with the value in the stencil buffer for the pixel in question. This means that for the stencil test to pass, the value from the stencil buffer must be at least equal to or greater than the reference value of 1, which is in agreement with the depth-pass algorithm. Lines 17 through 19 draw the quad that covers the entire screen, and the alpha blending will kick in to shade a pixel onscreen that passes the stencil test. Lines 22 through 24 would
238
restore the original render states. This concludes the DepthPassCPU sample. In the next section, we look at the stenciling operations of the depth-fail technique, which are slightly different from that of the depth-pass technique discussed here.
239
29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
{ m_pd3dDevice->SetMaterial( &m_ShadowVolBackFaceMaterial ); m_pd3dDevice->SetRenderState( D3DRS_ALPHABLENDENABLE, TRUE ); m_pd3dDevice->SetRenderState( D3DRS_SRCBLEND, D3DBLEND_SRCCOLOR ); m_pd3dDevice->SetRenderState( D3DRS_DESTBLEND, D3DBLEND_DESTALPHA ); } else m_pd3dDevice->SetRenderState( D3DRS_COLORWRITEENABLE, 0x00000000 ); // Draw back side of shadow volume m_pd3dDevice->SetTransform( D3DTS_WORLD, &m_matObject); m_pShadow->RenderShadowVolume( m_pd3dDevice ); m_pd3dDevice->SetTransform( D3DTS_WORLD, &m_matObject2); m_pShadow2->RenderShadowVolume( m_pd3dDevice ); // Now reverse cull order so front sides of shadow volume are written. m_pd3dDevice->SetRenderState( D3DRS_CULLMODE, D3DCULL_CCW ); // Reverse the stencil op for back face m_pd3dDevice->SetRenderState( D3DRS_STENCILZFAIL, D3DSTENCILOP_DECR ); // Show shadow volume front faces? if ( m_bShowShadowVolFrontFace ) { m_pd3dDevice->SetMaterial( &m_ShadowVolFrontFaceMaterial ); m_pd3dDevice->SetRenderState( D3DRS_COLORWRITEENABLE, 0x0000000F ); m_pd3dDevice->SetRenderState( D3DRS_ALPHABLENDENABLE, TRUE ); m_pd3dDevice->SetRenderState( D3DRS_SRCBLEND, D3DBLEND_SRCCOLOR ); m_pd3dDevice->SetRenderState( D3DRS_DESTBLEND, D3DBLEND_DESTALPHA ); } else m_pd3dDevice->SetRenderState( D3DRS_COLORWRITEENABLE, 0x00000000 ); // Draw front side of shadow volume m_pd3dDevice->SetTransform( D3DTS_WORLD, &m_matObject ); m_pShadow->RenderShadowVolume( m_pd3dDevice ); m_pd3dDevice->SetTransform( D3DTS_WORLD, &m_matObject2 ); m_pShadow2->RenderShadowVolume( m_pd3dDevice ); // Restore render states m_pd3dDevice->SetRenderState( m_pd3dDevice->SetRenderState( m_pd3dDevice->SetRenderState( m_pd3dDevice->SetRenderState(
240
Lines 4 through 16 basically do the same setting up as the RenderShadowVolume() function in the DepthPassCPU sample. Line 19 sets the stencil operation to increment the stencil count if the stencil test passes while the depth test fails. Note that the stencil test has been set to always pass in line 14, and hence only the depth test matters here. Incrementing the stencil values with the failure of the depth test is in agreement with the depth-fail algorithm presented in the Depth-fail (z-fail) section. Line 22 sets the z-bias level for the rendering of the shadow volume to a level of 0 to force it to render behind the actual occluder geometries. Lets ignore this for the time being; we shall return to the z-bias issue soon in the Rendering Shadow Volume Capping section. In accordance with the depth-fail algorithm, we reverse the culling mode in line 25 to render the back faces of the shadow volume. The code from lines 28 through 36 either set up alpha blending to expose the shadow volume or disable color writes, depending on whether the program is showing the shadow volume. We then draw the back faces with the code from lines 39 to 42. The first step of the depth-fail algorithm is complete. Next, we reverse the culling mode to draw the front faces at line 45 and set the stencil operation to decrement the stencil values with depth test failure at line 47. Lines 50 through 59 do the necessary settings, depending on whether the program is exposing the front faces of the shadow volume to the viewer. Lines 62 through 65 draw the occluders front faces. The render states are restored with the code from lines 68 through 72. Note that the same logic of applying wrapping, instead of clamping, and stencil incrementing and decrementing operations applies for the depth-fail RenderShadowVolume() function described above. The DrawShadow() function that shades the pixels in shadows are similar for both the DepthPassCPU and DepthFailCPU samples.
241
The stencil operations described in this section are one-sided in nature and hence require two passes to render the shadow volume. Newer graphics cards that support DirectX 9 provide new two-sided stencil operations that allow the rendering of shadow volumes in a single pass. All the appropriate front- and back-face stencil operations fill up the stencil buffers in a single rendering pass. For more details on the two-sided stencil mode, please refer to the section titled DirectX 9 HLSL Samples. We now continue with the DepthFailCPU sample by tackling the last tricky issue of rendering the shadow volume capping for the depth-fail technique (remember the z-biasing at line 22?).
242
from graphics APIs. Eric Lengyel [26] described how a separate projection matrix could be computed to render polygons at different depth values without altering its projected screen coordinates of texture mapping perspective. Tweaking the projected depth values on a per-object basis can provide fine control and sometimes better performance. But the implementation is also comparatively more involved. Choosing the appropriate camera space offset can also be messy due to the non-uniform nature of depth buffer precision for perspective viewing [11]. Depth precision can become horrendously poor with increasing distance from the camera and cause polygons that are close together, in terms of depth values, to be rendered incorrectly. For example, a piece of tapestry on the wall may get rendered behind the wall in several places due to poor depth precision as the viewer moves farther away. Depth precision errors are usually accompanied by the flickering of polygons, which is a problem commonly known as z-fighting. The camera space offset used for tweaking projected depth values needs to be adjusted accordingly to account for this non-linear behavior. Alternatively, we can simply make use of Direct3Ds depth bias capability to render the front capping properly without worrying about anything else. In Direct3D, depth values of fragments generated while rasterizing a primitive can be biased to help mitigate z-fighting issues when drawing coplanar polygons. The D3DRS_ZBIAS flag in Direct3Ds D3DRENDERSTATETYPE can be used to bias the occluders front-facing geometries so that they are more likely to be rendered in front of its shadow volume front capping.
01 02 03 04 05 06 07 08 09 // Set higher z-bias for occluders m_pd3dDevice->SetRenderState(D3DRS_ZBIAS, 1); // Render occluder here . . . // Set lower z-bias for shadow volumes m_pd3dDevice->SetRenderState(D3DRS_ZBIAS, 0); // Render shadow volume here
243
Simply setting the D3DRS_ZBIAS values before rendering the two groups of coplanar geometries, as shown in the code above, would achieve the desired effect. We set the z-bias flag value to a higher value for the occluders geometries and a lower value for its shadow volume. This ensures that the front capping of the shadow volume is rendered behind the occluders front-facing geometries. This completes the entire depth-fail algorithm, and the stencil buffer would now be filled with the correct stencil values that are needed for comparison in order to shade the pixels in shadow. The pixel shading is done by the DrawShadow() function, which is similar to the one used in the DepthPassCPU sample. With that, we conclude the DepthFailCPU sample. As a side note, DirectX 9 [12] is able to distinguish between legacy devices that expose the D3DRS_ZBIAS and those that can perform true slope-scale-based depth bias. Two new floatingpoint values, D3DRS_DEPTHBIAS and D3DRS_SLOPESCALEDEPTHBIAS, are used to compute the offset. The offset is added to the fragments interpolated depth value to produce the final depth value that is used for depth testing. The new caps for these two values are D3DPRASTERCAPS_DEPTHBIAS and D3DPRASTERCAPS_SLOPESCALEDEPTHBIAS. D3DRS_DEPTHBIAS is used in the DirectX 9 HLSL Samples section. This ends our discussion of the implementation of both the depth-pass (DepthPassCPU sample) and depth-fail (DepthFailCPU sample) algorithms on the CPU. Next up, we dive straight into the methodology and implementation of the depth-fail algorithm using the programmable graphics pipeline!
244
great potential and flexibility for graphics programmers to achieve effects at a level of realism never before dreamed possible. Different lighting methods, texturing, and geometry manipulation are now possible with the use of vertex and pixel shaders. Rendering engines are no longer bound by the limitations imposed by fixedfunction pipelines. With this explosion of graphics shaders, we need to look at the effects that we have achieved with the fixed-function pipeline in the past and see if it is possible to do it more efficiently and faster with the programmable graphics pipeline. That is exactly what we are going to do by implementing the depth-fail stencil shadow volume algorithm using vertex shaders. The reader should note that implementing shadow volume in shaders may or may not improve shadow volume performance. We discuss the pros and cons of using shaders for shadow volumes in the Better with Shaders? section after we have gone through its implementation.
How It Is Done
For stencil shadow volume implementation using shaders, the general steps presented in the previous How It Is Done section for implementation on the CPU still applies. The main difference lies in the execution of the silhouette calculation. When we talk about implementing stencil shadow volume using shaders, we are actually referring mainly to the offloading of the silhouette computation from the CPU to the GPU. This means that we do not compute the silhouette of the occluder in our program; instead, this is done by a vertex program running on the GPU that is fed with the appropriate preprocessed occluder geometry and vertex shader constants. Lets list the general steps for implementing shadow volumes using vertex shaders: 1. Preprocessing of occluder geometry. Insert degenerate quads into edges shared by exactly two triangles. 2. Render the scene to fill the depth buffer with the correct z-values.
245
3. Select a light source. Clear the stencil buffer if this is the first light. 4. Set up the stencil operations, update vertex shader constants, and render the shadow volumes using the vertex shader. 5. Repeat steps 3 to 4 for all the selected lights in the scene. 6. Using the stencil buffer, do a lighting pass (or make it a tone darker) to shade the pixels that correspond to non-zero stencil values. As far as the scene rendering pass (step 2), stencil operations (step 4), and lighting pass (step 6) are concerned, there is little difference from the CPU implementations. The main difference lies in the preprocessing of the occluder geometry in step 1 and the setting of the vertex shader constants and rendering in step 4. In a nutshell, we preprocess the occluders mesh in such a way that when it is fed into the graphics pipeline, the vertex shader deforms it into the shadow volume that we desire. In the following sections, we go through the steps and peruse the code that comes with the samples FiniteGPU and InfiniteGPU. As the name implies, the FiniteGPU sample demonstrates finite shadow volume extrusion using vertex shaders, while the InfiniteGPU sample implements infinite shadow volume extrusion through homogenous coordinates discussed in the Forming the Shadow Volume section. Both samples are based on DirectX 8.1. The section titled DirectX 9 HLSL Samples discusses two similar samples that are based on DirectX 9. Note that both the FiniteGPU and InfiniteGPU samples implement the depth-fail stencil shadow volume algorithm for good reasons, which we find out about soon.
Preprocessing of Data
The very first step to implementing shadow volumes in shaders is to preprocess the original geometry into a form usable by the vertex shader. Remember that during the creation of the shadow volume, we need to create new geometry data such as the
246
extruded vertices and the faces that form the sides and capping of the volume. With vertex shaders, this is not possible, as the current generation of programmable graphics pipeline does not allow for the creation of new vertex data. It is strictly a one vertex in and one vertex out pipeline. This limitation is probably not going to go away in the foreseeable future. Hence, we need to overcome it by preprocessing the source geometry data in such a way that makes it possible to form a shadow volume in any direction without creating new geometries. Note that both FiniteGPU and InfiniteGPU use the same preprocessing function.
Figure 17 depicts the preprocessing of the source geometry that forms a cube; it only shows the front faces for simplicity. The shared edges of the faces are filtered out, and a degenerate quad is inserted to replace each shared edge. Degenerate quads are formed by triangles with zero area. The two edges that form the opposing sides of each degenerate quad have the same positional values (same x, y, and z coordinates) but different face normals. In the FiniteGPU and InfiniteGPU sample, preprocessing is done by the member function Create() of the CShadow class. Lets briefly run through the preprocessing algorithm: 1. Step through all the faces in the source mesh. 2. Compute face normal for each face. 3. Step through the three edges of each face. a. Insert edge into a list for checking. b. If edge already exists in the list (shared edge found):
247
i.
If normals of faces sharing the edge are not parallel, insert degenerate quad into the output list.
ii. Else, only insert the shared edge into the output list. c. Remove the current edge and any shared edge from the checklist. 4. Create index and vertex buffers with only position and normal information from the output list. 5. If there are any vertices left in the checklist, the source mesh is not a closed volume since all edges should be shared in a closed volume mesh. Note that the above algorithm also requires the source mesh to be a closed volume, which is the same requirement imposed in the CPU determination of silhouette edges presented earlier. The code in the Create() function follows the above steps in preprocessing the source mesh data. The reader should study the code to get a better understanding of the preprocessing algorithm. The algorithm implemented emphasizes clarity over efficiency. The general implementation in Create() does not handle welded meshes and is similar to the preprocessing algorithm used in the ATI demos [18, 19]. Many other more efficient algorithms do exist. A major problem with preprocessing geometries for shader implementation of shader volume is the large number of vertices that it generates. Typical final preprocessed meshes contain around three times more vertices compared with the source meshes. This is a major problem for shader implementations of shadow volume, as we are stretching the vertex throughput of the GPU during the rendering of the shadow volume. We discuss this problem in more detail in the Better with Shaders? section. For now, lets implement a simple optimization to try to cut down the number of vertices generated during preprocessing. Notice that we do not indiscriminately insert degenerate quads into every shared edge in the preprocessing algorithm. Doing so would be very inefficient, and the final preprocessed
248
polygon count would explode. A simple optimization would be to test whether a shared edge would have a good chance of becoming a silhouette edge. If a shared edge has almost zero chances of becoming a silhouette edge, then there is really no need for the insertion of a degenerate quad to replace that edge. A simple way to determine the chances of an edge forming part of a silhouette is to test the parallelism of the normals of the faces that share it. If the two faces have normals that are almost parallel in the same direction, the shared edge lies in a flat surface and would have little chance of becoming part of a silhouette. In fact, if the face normals are exactly parallel, it is not possible for the shared edge to be part of any silhouettes. Thus, a simple dot product of the two face normals will suffice for such a test. If the dot product result is 1.0, the edge is left untouched, as it cannot possibly become a silhouette edge. In actual implementation, we can further cut down the number of vertices generated by testing the dot product result against values such as 0.9 or 0.8, which would then include surfaces that are quite flat. Figure 18 shows that this simple optimization halves the number of degenerate triangles needed for the front faces of our simple cube from 12 (Figure 17) to 6.
Figure 18: Shared edges on flat surfaces need not be replaced by degenerate quads.
An obvious point to note here is that such preprocessing should focus on minimizing the final geometry count instead of processing speed. In fact, the preprocessing should be done entirely offline. In the next section, we look at how these degenerate quads help form shadow volumes on the hardware without the need to create new vertices during silhouette computation.
249
On the left side of Figure 19, we can see two faces with a common shared edge that has been replaced by a degenerate quad. The two opposing edges of the degenerate quad contain the face normal of the face to which it belongs. Next, assume that the direction of a light source is as shown on the right. Face 1 is back facing the light source, while face 2 is front facing the light source. Hence, the shared edge becomes part of the silhouette, as seen from the position of the light source. Vertices that are facing away from the light source would then be extruded in the direction of the lights ray, as shown on the right side of Figure 19. This means that the opposing edges of the degenerate quad are stretched out to form a normal quad with a non-zero area. This is exactly how the sides of the shadow volume are formed! Also note that the extruded face 1 now becomes the back capping, while the untouched face 2 automatically acts as the front capping. Hence, for shader implementation in the FiniteGPU and
250
InfiniteGPU samples, it only makes sense to implement the stencil operations according to the depth-fail stencil algorithm, as the shadow volume capping already exists! From this point onward, we are going into the implementations of the FiniteGPU and InfiniteGPU samples. The two samples are differentiated by the vertex shader constants setup and the vertex shader code they execute. This means that of the six steps presented in the last How It Is Done section, only step 4 is different between the two samples.
We declare the vertex shader with the vertex position and face normal lined up as input registers 0 and 1. The VisualStudio.NET
251
project files for the FiniteGPU sample have been set to compile the vertex shader code using nVidias NVASM [17] into the .vso binary, which is fed into the CreateVSFromCompiledFile() function taken from Wolfgang F. Engel [16]. Next, we shall take a look at the RenderShadowVolume() function before going into the vertex shader constants setting.
01 HRESULT CDepthFail::RenderShadowVolume() 02 { 03 // Disable z-buffer writes, z-testing still occurs, enable stencil buffer 04 m_pd3dDevice->SetRenderState( D3DRS_ZWRITEENABLE, FALSE ); 05 m_pd3dDevice->SetRenderState( D3DRS_STENCILENABLE, TRUE ); 06 07 // Don't bother with interpolating color 08 m_pd3dDevice->SetRenderState( D3DRS_SHADEMODE, D3DSHADE_FLAT ); 09 10 // (StencilRef & StencilMask) CompFunc (StencilBufferValue & StencilMask) 11 m_pd3dDevice->SetRenderState( D3DRS_STENCILREF, 0x1 ); 12 m_pd3dDevice->SetRenderState( D3DRS_STENCILMASK, 0xffffffff ); 13 m_pd3dDevice->SetRenderState( D3DRS_STENCILWRITEMASK, 0xffffffff ); 14 m_pd3dDevice->SetRenderState( D3DRS_STENCILFUNC, D3DCMP_ALWAYS ); 15 m_pd3dDevice->SetRenderState( D3DRS_STENCILPASS, D3DSTENCILOP_KEEP ); 16 m_pd3dDevice->SetRenderState( D3DRS_STENCILFAIL, D3DSTENCILOP_KEEP ); 17 18 // Back face depth test fail -> Incr 19 m_pd3dDevice->SetRenderState( D3DRS_STENCILZFAIL, D3DSTENCILOP_INCR ); 20 21 // Set lower z-bias for shadow volumes 22 m_pd3dDevice->SetRenderState(D3DRS_ZBIAS, 0); 23 24 // Now reverse cull order so back sides of shadow volume are written. 25 m_pd3dDevice->SetRenderState( D3DRS_CULLMODE, D3DCULL_CW ); 26 27 // Show shadow volume back faces? 28 if ( m_bShowShadowVolBackFace ) 29 { 30 m_pd3dDevice->SetRenderState( D3DRS_ALPHABLENDENABLE, TRUE ); 31 m_pd3dDevice->SetRenderState( D3DRS_SRCBLEND, D3DBLEND_SRCCOLOR ); 32 m_pd3dDevice->SetRenderState( D3DRS_DESTBLEND, D3DBLEND_DESTALPHA ); 33 } 34 else 35 m_pd3dDevice->SetRenderState( D3DRS_COLORWRITEENABLE, 0x00000000 );
252
36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78
// Set up shader constants and render the shadow for object 1 m_pShadow->SetShaderConstants( &m_pLight, &m_matObject, &m_matView, &m_matProject ); m_pShadow->RenderShadow(); // Set up shader constants and render the shadow for object 2 m_pShadow2->SetShaderConstants( &m_pLight, &m_matObject2, &m_matView, &m_matProject ); m_pShadow2->RenderShadow(); // Now reverse cull order so front sides of shadow volume are written. m_pd3dDevice->SetRenderState( D3DRS_CULLMODE, D3DCULL_CCW ); // Reverse the stencil op for front face m_pd3dDevice->SetRenderState( D3DRS_STENCILZFAIL, D3DSTENCILOP_DECR ); // Show shadow volume front faces? if ( m_bShowShadowVolFrontFace ) { m_pd3dDevice->SetRenderState( D3DRS_COLORWRITEENABLE, 0x0000000F ); m_pd3dDevice->SetRenderState( D3DRS_ALPHABLENDENABLE, TRUE ); m_pd3dDevice->SetRenderState( D3DRS_SRCBLEND, D3DBLEND_SRCCOLOR ); m_pd3dDevice->SetRenderState( D3DRS_DESTBLEND, D3DBLEND_DESTALPHA ); } else m_pd3dDevice->SetRenderState( D3DRS_COLORWRITEENABLE, 0x00000000 ); // Set up shader constants and render the shadow for object 1 m_pShadow->SetShaderConstants( &m_pLight, &m_matObject, &m_matView, &m_matProject ); m_pShadow->RenderShadow(); // Set up shader constants and render the shadow for object 2 m_pShadow2->SetShaderConstants( &m_pLight, &m_matObject2, &m_matView, &m_matProject ); m_pShadow2->RenderShadow(); m_pd3dDevice->SetRenderState( m_pd3dDevice->SetRenderState( m_pd3dDevice->SetRenderState( m_pd3dDevice->SetRenderState( m_pd3dDevice->SetRenderState( D3DRS_COLORWRITEENABLE, 0x0000000F ); D3DRS_SHADEMODE, D3DSHADE_GOURAUD ); D3DRS_ZWRITEENABLE, TRUE ); D3DRS_STENCILENABLE, FALSE ); D3DRS_ALPHABLENDENABLE, FALSE );
253
79 80 return S_OK; 81 }
The setup code from lines 4 to 35 is similar to that in the DepthFailCPU sample discussed earlier. The difference comes in lines 38 through 45 where we are required to set the shader constants prior to rendering the preprocessed shadow volume geometry using our vertex program. The same goes for the second shadow volume rendering pass in lines 65 through 72. Note that two-sided stenciling, as presented in the DirectX 9 HLSL Samples section, would work as well here to render the shadow volume in a single pass. In fact, the DirectX 9 samples in that section implement both stencil modes for comparison. Takashi Imagire [14] presented a depth-pass shadow volume implementation that utilizes two-sided stenciling and vertex shaders. Lets take a look at the code of the SetShaderConstants() function before going into the vertex shader code.
01 void CShadow::SetShaderConstants( const D3DLIGHT8* pLight, 02 const D3DXMATRIX* matWorld, 03 const D3DXMATRIX* matView, 04 const D3DXMATRIX* matProj ) 05 { 06 D3DXMATRIX matClip, matInvWorld; 07 D3DXMatrixMultiply( &matClip, matWorld, matView ); 08 D3DXMatrixMultiply( &matClip, &matClip, matProj ); 09 D3DXMatrixInverse( &matInvWorld, NULL, matWorld ); 10 11 D3DXVECTOR4 vConst( 0.0f, 0.0f, 0.0f, m_fExtrusionLen ); 12 13 // Yellowish-green hue for drawing shadow volume if needed 14 D3DXVECTOR4 vColor( 0.3f, 0.4f, 0.2f, 0.0f ); 15 16 // Light pos in world space 17 D3DXVECTOR4 objectLightPos = D3DXVECTOR4( pLight->Position.x, 18 pLight->Position.y, pLight->Position.z, 1.0f ); 19 20 // Transform light pos to object space 21 D3DXVec4Transform( &objectLightPos, &objectLightPos, &matInvWorld ); 22
254
23 24 25 26 27 28 }
0, 1, 2, 6,
The transformation matrix for clipping space is the first thing to be computed in lines 7 and 8. At line 11, we set up a vector with the w component as the member variable m_fExtrusionLen that defines the absolute extrusion distance. We define a vector at line 14 to hold an RGBA color value in case the program needs to expose the rendering of the shadow volume to the viewer. The light source position is transformed from world space to object space at line 21. The reason for doing this is to allow us to compute the light ray vector in object space without the need to transform the face normal. It is obviously far more efficient to incur a one-time transformation cost for the light position, as opposed to transforming every single face normal. The vectors are lined up in the constants registers, as shown in lines 24 through 27. It is time to dive into the vertex shader code.
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 // // // // c0 c1 c2-c5 c6 : : : : 0, 0, 0, m_fExtrusionLen Light pos in object space World*View*Proj matrix Color for exposing the shadow volume
vs.1.1 // Output diffuse color to expose shadow volume to viewer if needed mov oD0, c6 // Ray from light to pt in object space sub r1, v0, c1 // Normalize ray dp3 r1.w, r1, r1 rsq r1.w, r1.w mul r1, r1, r1.w // Dot ray and normal dp3 r10.w, v1, r1
255
20 21 // Normal faces away from light if dot result < 0.0 22 slt r10.x, r10.w, c0.x 23 24 // Extrude along ray 25 mul r10, r10.x, c0.w 26 mad r0, r1, r10.x, v0 27 28 // Transform to clip space and output pt 29 mul r4, r0.x, c[2] 30 mad r4, r0.y, c[3], r4 31 mad r4, r0.z, c[4], r4 32 add oPos, c[5], r4
We immediately output the diffuse color at line 8 using constant register c6, which was set with the RGBA color values. This can be skipped entirely if we do not want to expose the shadow volume to the viewer. Next, we compute the vector of the incident light ray at line 11 and normalize the result. The dot product of the light ray and the face normal is done at line 19, and the result is stored in the w component of r10. At line 22, we compare the result of the dot product with 0.0 and form a masking value using the result of this comparison. If the dot product result is less than 0.0, this means that the angle between the vectors is larger than 90 degrees (or you can also say smaller than 90 degrees), and the vertex has a face normal pointing away from the light source. For this case, the masking value is stored as 1.0 in the x component of r10. For the other case, whereby the dot product result is not less than 0.0, the masking value is set as 0.0. The extrusion (or rather, the displacement) of the vertex is done in lines 25 and 26. We multiply the masking value with the extrusion distance to compute the final extrusion distance. Since the masking value can only be 0.0 or 1.0, the result of the multiplication can be either a zero or non-zero extrusion distance. Line 26 multiplies the normalized light ray vector with the extrusion distance and adds it into the vertexs position, effectively extruding the vertex in the direction of the light ray. If the masking value is 0.0, then the extrusion distance will be 0.0 and the vertex stays
256
unchanged. Lines 29 to 32 simply transform the final vertex to clip space and send the result to the vertex position output register. This concludes the entire implementation of the FiniteGPU sample. Next up, we look at the InfiniteGPU sample that makes use of the homogeneous coordinate system, discussed previously in the Forming the Shadow Volume section, to extrude shadow volumes to infinity.
257
258
40 41 42 43 44 45 }
Since we are considering an omnidirectional point light, the light transformation matrix can be created solely by translation at line 13. We create the LightClip transformation matrix (light*view* projection) at lines 15 to 17. The LightClip transformation goes from WorldLight space to clip space, similar to the normal clip space transformation matrix where we go from world space to clip space. The transformation matrix to WorldLight space is computed at lines 21 to 23. At line 37, we transform the light position from its original world space to the occluders object space to compute the light ray vector in object space within the shader. This avoids the need to transform the face normals to world space for every single vertex and also results in shorter vertex shader code. Finally, lines 40 through 44 define how the values will be lined up in the constant registers. Next up, lets jump right into the vertex shader code:
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 // // // // // c0 c1 c2-c5 c6-c9 c10 : : : : : Light position in object space 1, 1, 1, 0 Light * View * Proj = LightClip? WorldInvLight matrix Color for exposing the shadow volume
vs.1.1 // Output diffuse color to expose shadow volume to viewer if needed mov oD0, c10 // Light to vertex ray in object space sub r1, v0, c0 // Transform vertex from object space to WorldLight space // where the light is centered on origin mul r4, v0.x, c[6] mad r4, v0.y, c[7], r4 mad r4, v0.z, c[8], r4
259
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
add r9, c[9], r4 // Normalize ray computed previously dp3 r1.w, r1, r1 rsq r1.w, r1.w mul r1, r1, r1.w mov r10, c1 // Dot ray and normal dp3 r10.w, v1, r1 // If dot result < 0.0 = face away from light // Form mask 1,1,1,1 for light facing, OR mask 1,1,1,0 for non light facing slt r10, c1.w, r10 // Set w value to 0.0 for infinite extrusion or 1.0 for no extrusion mul r9, r9, r10 // Transform final vertex to LightClip space mul r4, r9.x, c[2] mad r4, r9.y, c[3], r4 mad r4, r9.z, c[4], r4 mad oPos, r9.w, c[5], r4
At line 12, we form the vector of the light ray in object space (remember, we do not need to transform the face normals if we do this in object space). Next, we proceed to transform the vertex to WorldLight space (World*InvLight) at lines 16 to 19. Next, we get back to object space and normalize our light ray vector at lines 22 through 24. I hate to use mov but was forced to do so at line 26 because we need the masking values (1,1,1,0) for the slt command later on. At line 29, we use the light ray vector to perform a dot product with the face normal. Line 33 is the heart of this shader. It compares the dot product result with the mask (1,1,1,0) that we loaded earlier on and creates either a mask (1,1,1,1) for light-facing vertices or (1,1,1,0) for vertices that face away from the light. We are going to make good use of this mask at line 36 to decide whether a vertex stays put or packs up for the trip to infinity. Obviously, those that face the light will be left unscathed, but those that face away from the light will
260
have their w component zeroed, and the homogenous representation of a point becomes a representation of a vector. Finally, we perform the all-important transformation to clip space and pass the result to the output register.
Note You can try moving toward the extruded geometries for this sample (use wireframe mode; it is easier to see), but you will find that it never gets any closer! The vertices at infinity are fixed at a particular point on screen. Try the same thing with the other samples, and you will fly past the extruded volume in no time.
With this, we conclude the InfiniteGPU sample. You are now armed with a good working knowledge of not just one stencil shadow volume method but four of the same things done in a rather different fashion! You are probably confused and wondering which one of these suits your needs. Read on about some efficiency issues, possible optimizations, and high-level design problems that can help you make an informed choice.
261
the source data sets for shadow volume creation are too large, resulting in an even greater amount of wasted vertices. We can tackle the problem by optimizing the degenerate quad insertion algorithm and also reducing the source data sets needed for shadow volume creation. Previously, in the Preprocessing of Data section, we discussed a simple optimization to avoid inserting degenerate quads into edges shared by faces forming a flat plane. With good optimizations, the pre- and post-polygon count ratio can usually be reduced to around 2.0 without severe visual artifacts. Another possible optimization is to reduce the source data sets used for inserting degenerate quads. This encompasses the use of simplified models with a lower polygon count or the removal of useless polygons. The gem presented by Alex Vlachos and Drew Card [20] described two such optimizations in the form of vertex removal and edge collapsing. In most general cases where low to medium (MD2 or MD3) polygon count models are used, implementations on both CPU and GPU are comparable. Another key concern is the magnification of this inefficiency when a scene contains a large number of shadow-casting light sources. The iteration through the light sources to generate shadow volumes would inevitably strain the graphics hardware with more wasted vertices. But with the use of the shadow volume methodology for casting shadows, we have to be very careful with the selection of light sources within a scene even when it is done on the CPU. We discuss light sources selection further in the next section. Finally, a small incentive for using shaders to implement the shadow volume generation is that the memory requirement is very constant, as opposed to the dynamic shadow volume size in CPU implementations. This is because the silhouettes of occluders can sometimes vary drastically at different angles of view. This directly affects the total geometry count and can cause further problems if the initial allocated memory is too small and reallocation is needed. GPU implementation does not suffer from this problem, as the preprocessed shadow volume geometry is loaded up as a static vertex buffer that contains all the vertices that will be needed for shadow volume generation from any angle.
262
Overall, in a normal game setting, where the CPU is required for artificial intelligence, physics, network (encryption/decryption), input, scripting, and a whole host of other computations, GPU implementation of shadow volume usually edges slightly ahead in terms of performance. However, readers are encouraged to evaluate and profile both approaches vigorously within their setup in order to find the best way for implementing shadow volumes. Greg James [15] showcased the use of degenerate quads for a vertex shader-based shadow volume implementation. Similarly, Chris Brennans article [18] regarding shadow volumes used in the ATI island demo [19] also uses the same approach with vertex shaders.
263
samples are comparatively simpler than the previous samples that had vertex shaders coded in assembly. Lets take a look at how to make use of the new two-sided stencil mode introduced in the HLSL samples. With DirectX 9, the Direct3D API now includes support for two-sided stencil operations. For both the depth-pass and depth-fail stenciling operations described earlier, we need to draw the shadow volume in two passes, once for the front faces and once for the back faces. This is due to the need to change the stenciling operations before the start of each pass, since a different set of stenciling operations is needed for drawing the front faces and back faces of the shadow volume. The need for two passes to render the shadow volume geometries places extra strain on the vertex throughput of the GPU. With two-sided stenciling in DirectX 9, we can specify different sets of stenciling operations for both front faces and back faces before proceeding to render the shadow volume geometries in a single stenciling pass. Two-sided stenciling mode ensures that the stencil buffer values are filled accordingly, as if we are rendering the front and back faces separately with different stenciling operations. Whenever two-sided stenciling mode is supported, we should make use of it and for good reason, too. First, we just need to send the shadow volume geometries to the graphics pipeline once instead of twice. With that comes the savings on transforming primitives, memory bandwidth between transfers, and driver overhead for sending the geometries to hardware. The graphics hardware would probably also avoid inefficiencies that arise when rendering multiple culled polygons, which causes the rasterizer to go idle, since there is nothing to draw. For two-sided stenciling mode, we need to render with no culling at all, and hardware rasterizers can minimize the idling time. Note that this may not be true for all hardware vendors since graphics hardware and driver designs vary wildly from vendor to vendor. We should also note that the number of pixels rasterized is exactly the same as doing two passes to render the shadow volume. This means that fillrate would be the same for both stenciling modes. Considering the potential savings in other areas beside fillrate, two-sided
264
stenciling mode is a highly attractive new hardware support to assimilate into any stencil shadow volume implementations. A new render state, D3DRS_TWOSIDEDSTENCILMODE, can be set to true to activate two-sided stenciling. It is disabled by default. When two-sided stenciling is enabled, the following render states will apply only to front-facing triangles:
Render States D3DRS_STENCILFAIL D3DRS_STENCILZFAIL D3DRS_STENCILPASS D3DRS_STENCILFUNC Operations D3DSTENCILOP to do if stencil test fails. D3DSTENCILOP to do if stencil test passes and z-test fails. D3DSTENCILOP to do if both stencil and z-tests pass. D3DCMPFUNC function. Stencil test passes if ((ref & mask) stencilfn (stencil & mask)) is true.
The following new render states will also apply only to back-facing triangles:
Render States D3DRS_CCW_STENCILFAIL D3DRS_CCW_STENCILZFAIL D3DRS_CCW_STENCILPASS D3DRS_CCW_STENCILFUNC Operations D3DSTENCILOP to do if stencil test fails. D3DSTENCILOP to do if stencil test passes and z-test fails. D3DSTENCILOP to do if both stencil and z-tests pass. D3DCMPFUNC function. Stencil test passes if ((ref & mask) stencilfn (stencil & mask)) is true.
The remaining stencil render states not listed in the two tables above will always apply to both front- and back-facing triangles. As with normal stenciling operations, the two-sided stenciling render states will be ignored for point sprites and lines. Lets look at the actual code needed to set up two-sided depth-fail stenciling operations in DirectX 9.
01 02 03 04 05 // Disable z write, color write, use flat shade, and set to cull none m_pd3dDevice->SetRenderState( D3DRS_ZWRITEENABLE, FALSE ); m_pd3dDevice->SetRenderState( D3DRS_COLORWRITEENABLE, FALSE ); m_pd3dDevice->SetRenderState( D3DRS_SHADEMODE, D3DSHADE_FLAT ); m_pd3dDevice->SetRenderState( D3DRS_CULLMODE, D3DCULL_NONE );
265
06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
// Enable stencil operations and two-sided stencil mode m_pd3dDevice->SetRenderState( D3DRS_STENCILENABLE, TRUE ); m_pd3dDevice->SetRenderState( D3DRS_TWOSIDEDSTENCILMODE, TRUE ); // Set front-facing stencil function to always pass m_pd3dDevice->SetRenderState( D3DRS_STENCILFUNC, D3DCMP_ALWAYS ); // Set back-facing stencil function to always pass m_pd3dDevice->SetRenderState( D3DRS_CCW_STENCILFUNC, D3DCMP_ALWAYS ); // Set stencil ref. value to 1 with full mask m_pd3dDevice->SetRenderState( D3DRS_STENCILREF, 0x1 ); m_pd3dDevice->SetRenderState( D3DRS_STENCILMASK, 0xffffffff ); m_pd3dDevice->SetRenderState( D3DRS_STENCILWRITEMASK, 0xffffffff ); // Set up stencil operations for depth-fail algorithm // Set up stencil to increment when z-fail occurs for back faces m_pd3dDevice->SetRenderState( D3DRS_CCW_STENCILPASS, D3DSTENCILOP_KEEP ); m_pd3dDevice->SetRenderState( D3DRS_CCW_STENCILFAIL, D3DSTENCILOP_KEEP ); m_pd3dDevice->SetRenderState( D3DRS_CCW_STENCILZFAIL, D3DSTENCILOP_INCR ); // Set up stencil to decrement when z-fail occurs for front faces m_pd3dDevice->SetRenderState( D3DRS_STENCILPASS, D3DSTENCILOP_KEEP ); m_pd3dDevice->SetRenderState( D3DRS_STENCILFAIL, D3DSTENCILOP_KEEP ); m_pd3dDevice->SetRenderState( D3DRS_STENCILZFAIL, D3DSTENCILOP_DECR );
The code above shows all the render state setup needed to make use of two-sided stenciling operations for the depth-fail algorithm. Most of the setup code is similar to that used in the DepthFailCPU sample, with the exception of a few that involve the new render states. The HLSL samples implement the same set of render states within the effects file HLSL_ShadowVolume.fx using the effects file syntax. Note that line 5 sets the culling mode to none, since we would want to render both front- and back-facing triangles at the same time in a single pass. In fact, the ability to draw with no culling is a requirement for graphics drivers before two-sided stenciling support is possible. Lines 8 and 9 separately enable stenciling operations and two-sided stenciling mode. Note that two-sided
266
stenciling is disabled by default for compatibility with DirectX 8 behavior. As before, the stenciling function is set to always pass at line 12. However, since we are now working in two-sided stenciling mode, the render state in line 12 only affects the rendering of front-facing triangles. The new render state D3DRS_CCW_STENCILFUNC has to be set in line 15 to force the stenciling function for back-facing triangles to always pass as well. Lines 18 through 20 set up the stencil reference value and the masks, which affects both the front- and back-facing geometries. Finally, the code at lines 24 through 31 set up the stenciling operations for both front-facing and back-facing triangles, according to the requirements of the depth-fail algorithm. Be aware again that two different sets of render states are needed for both front-facing and back-facing geometries. Once the stencil operations have been set up, as shown in the above code, we can start rendering the shadow volume geometries, and the stencil buffers will be filled with the correct stencil values needed for drawing depth-fail shadows. At the time of publication, only the Radeon 9700/9800 and GeForceFX consumer graphics cards fully support DirectX 9. Therefore the reader should note that two-sided stenciling is not a standard capability of most graphics hardware (even for that once-pricey GeForce4 Ti 4600 card). As such, implementations utilizing two-sided stenciling should always be backed up with hardware capability checks during program startup. A new capability bit, D3DSTENCILCAPS_TWOSIDED, was introduced in DirectX 9 for detecting devices that support this new stenciling mode. With this in mind, the HLSL samples implement both the old one-sided stencil mode and the new two-sided stencil mode, and on-the-fly switching between the two modes is possible.
267
268
duplicated vertices representing exactly the same point. To see an example of an unwelded mesh, open the mesh viewer tool that is part of the DirectX utilities and create a cube. Look at the vertices information of the cube, and you can see that there are 24 vertices instead of just eight. This is really unavoidable, since Direct3Ds version of a vertex structure contains color and normal information that cannot be shared by different faces referring to the same point due to differing lighting properties. Hence, extra vertices are generated for different faces with different color and lighting properties. The extra vertices are redundant as far as shadow volumes are concerned, but cannot be removed during the silhouette calculation without a considerable amount of comparison work. It is, therefore, wiser to use welded meshes for silhouette determination. The Direct3D mesh viewer utility provides a nifty option to do just that. Click MeshOps, then click Weld Vertices, and check Remove Back To Back Triangles, Regenerate Adjacency, and Weld All Vertices before welding. Alternatively, we can also make use of the mesh function D3DXWeldVertices to weld the mesh ourselves during data initialization. Alex Vlachos and Drew Card [20] also described a method to process complex source data sets into simpler, non-overlapping shadow volume geometries for static light sources. The method described involves computing a list of all the light-facing polygons, which is the brute-force way that we have been doing it in the Implementation on CPU and Implementation on GPU (Shaders) sections. Next, the list is sorted in a back-to-front order. Going through the list polygons, a small frustum is created for each face by using the light position and the edges of the face. The face itself is used as the fourth clip plane. This frustum is used to test for obscuring polygons, which is discarded. Doing so recursively creates an unobstructed front capping that eliminates overlapping polygons. Collapsing edges and removing excessive vertices could further optimize the front capping. This is indeed a good way to speed up shadow volume implementations for static light sources. Our shadow volumes implementation should have a flexible computation path that changes according to different
269
situational requirements. Instead of the generic brute-force method discussed in the samples, there are many other specific derivative methods that speed up our shadow volume implementation for different situations. This active selection of different methods is part of scene management in general, which we discuss shortly.
270
hinges on the use of open-ended anchored cones for fast hierarchical culling in order to extract the silhouette of the model from any viewpoint. When it comes to optimizations, we should always be wary of optimizing the wrong areas that only have minute contributions to the total overhead. As a rule of thumb, always go for the most frequently called functions or calculation paths. Other ways to improve computations include harnessing special capabilities of the CPU, such as the SSE, SSE2, and 3DNow! technologies provided by Intel and AMD, respectively. SSE (Streaming SIMD Extensions), for example, works on a quad float basis much like shaders. Operations are done in parallel across the four operands, giving a huge boost to any arithmetic-intensive computation.
271
Scissor rectangle support is finally available with the introduction of DirectX 9. The DirectX 9 scissor test is implemented by the functions SetScissorRect and GetScissorRect of the IDirect3DDevice9 interface. A new render state, D3DRS_SCISSORTESTENABLE, is also included to toggle the test.
272
not the viewer! The general rule is that if an occluder cannot see a light source, it cannot cast shadows related to that light source. We have to consider the occluder as a whole because it is nontrivial to handle cases whereby the occluder is partially exposed to the light source. Performing line-of-sight tests on a per-occluder basis can, however, be a big hit on performance, but doing such tests on a per-area basis would probably suffice for most situations. Distance and attenuated strength of light sources is also a good gauge of whether a light source has a big contribution to the scene makeup. Whenever the distance is beyond a certain predefined limit or when the attenuated strength of the light source is deemed too weak to create distinct shadows in an area, we should have no qualms about dropping it, even if there is a perfect line of sight between the light source and the occluders in the area. The culling of occluders is just as important as the culling of light sources. Once we have selected a list of light sources, we should commence with occluder culling before computing the shadow volumes. For each selected light source, we identify the occluders whose shadow volumes would contribute to any visible shadows within the view frustum. This test can be done easily by using a bounding volume constructed from the lights position and the three opposing sides of the view frustum, as shown in the following figure:
Figure 20: Occluder culling through the lights bounding volume when the light source is outside the view frustum
273
As shown in Figure 20, only the shadow volumes of the shaded cubes contribute to visible shadows within the view frustum. Any occluders that fell completely outside the bounding volume could be culled away (e.g., non-shaded cubes), since they would not contribute to any visible shadows. In the other case, where the light source is within the view frustum, we should use the view frustum itself as the bounding volume to perform the occluder culling as shown in Figure 21.
Figure 21: Occluder culling uses the entire view frustum as the bounding volume when the light source is within the frustum.
Occluder culling helps minimize the amount of work on silhouette computations and shadow volume rasterization on a per-light source basis, making each selected light pass more efficient and lean. The whole business of selecting light sources and culling occluders boils down to good scene management. An important component of scene management is the added responsibility of level planners and designers to work out an arrangement in which the light settings and positioning in a scene would not break or compromise the underlying shadow volume implementation. Therefore, it is often imperative that level planners and designers have a thorough understanding of the underlying light source
274
selection criteria made by the graphics engine before they set out to build the first scene. Charles Bloom [22] discussed some useful notes regarding the selection of light sources, while Cass Everitt and Mark Kilgard [27] presented several optimizations for implementing shadow volumes. Another aspect of scene management is identifying the relationship between occluders and light sources and possibly embedding this information somewhere with the scene hierarchy. Tagging geometries according to their movement behavior and relationship to a light source is a good way to branch into faster, specific shadow volume implementation quickly. For example, lets say that we have a static light source in an oil lamp on a chandelier hanging from the ceiling of a church. The spatial relationship between the light source and the occluder (chandelier) is static because the shadow volume of the occluder will never change, even if it is swinging, since the light source would be swinging in perfect synchronization as well. Hence, for the chandelier, which can be a complicated model, we can precompute an optimized front capping that can be reused every frame. Next, a player character walks into the church. The spatial relationship between the light source and the occluder (player model) is dynamic. Hence, for the player model, we should switch back to more elaborate (slower) shadow volume estimation or calculation. Proper scene management goes a long way in cutting the cost of shadow volume implementations while retaining all the visual enhancements that comes with it. Next, remember that one of the important requirements of shadow volumes is the need for closed volume meshes. As described before, this is needed because any gaps or holes within a mesh would potentially throw the stencil counting off-balance and thus break the shadow volume implementation. Such a requirement mandates the need for modelers and designers to alter their workflow and modeling style in order to avoid compromising the graphics engine. This is often the most daunting task for any program manager to undertake if there is a decision to turn toward stencil shadow volume support. As far as programmers are concerned, shadow volume implementations can be
275
made more robust by adding tests to detect unclosed volumes, reduce vertices, and even remove unwanted t-junctions (Lengyel [21]) during preprocessing.
The End
This ends our discussion on stencil shadow volume implementation. I would like to take this opportunity to thank ShaderX2 editor Wolfgang Engel and Andre Chen for reviewing this article. My heartfelt gratitude also goes to Wordware Publishing, Inc. and my
276
company, Silicon Illusions (www.siliconillusions.com), for their support and help. Many thanks also to James Paul Pilande who provided the models used in all the samples. A word about the samples: There are four samples built using the common files framework provided by DirectX 8.1 (C++): DepthPassCPU, DepthFailCPU, FiniteGPU, and InfiniteGPU. There are two additional samples based on the common files framework, effects file, and the HLSL support provided by DirectX 9.0 (C++). These are FiniteHLSL and InfiniteHLSL. All source data used are standard *.x file meshes re-authored in MilkShape 3D [23] from their original *.3ds format. Color Plates 6 and 7 provide examples of what can be done with the sample files. Plate 6 shows a scene consisting of dynamic shadow casters and light source. It showcases the increased realism with the help of accurate shadowing using the stencil shadow volume technique. This technique is fast becoming the preferred choice of shadowing in newer 3D games. Plate 7 shows the same scene re-rendered with the extruded shadow volume exposed. The stencil counting approach used in the technique makes accurate inter-occluders shadowing and self-shadowing possible.
References
[1] Crow, Frank, Shadow Algorithms for Computer Graphics, Computer Graphics, Vol. 11:3, SIGGRAPH 77, July 1977. [2] Heidmann, Tim, http://developer.nvidia.com/docs/IO/2585/ ATT/RealShadowsRealTime.pdf. [3] Kilgard, Mark, http://developer.nvidia.com/docs/IO/1348/ ATT/stencil.pdf. [4] Power Render X game engine, http://www.powerrender.com/ prx/index.htm. [5] Carmack, John, http://developer.nvidia.com/docs/IO/2585/ ATT/CarmackOnShadowVolumes.txt.
277
[6] Bilodeau, Bill and Mike Songy, Real Time Shadows, Creativity 1999, Creative Labs, Inc. Sponsored game developer conferences, Los Angeles, California, and Surrey, England, May 1999. [7] Kilgard, Mark, http://developer.nvidia.com/docs/IO/1451/ ATT/StencilShadows_CEDEC_E.pdf. [8] Lengyel, Eric, http://www.gamasutra.com/features/20021011/ lengyel_01.htm. [9] Lengyel, Eric, Mathematics for 3D Game Programming and Computer Graphics, Charles River Media, 2002. [10] Everitt, Cass, and Mark Kilgard, http://developer.nvidia.com/ docs/IO/2585/ATT/GDC2002_RobustShadowVolumes.pdf. [11] Moller, Tomas, and Eric Haines, Real-time Rendering, Second Edition, A K Peters Ltd., 2002, pp. 61-66, http://www.realtimerendering.com. [12] Microsoft DirectX MSDN, http://msdn.microsoft.com/library/ default.asp?url=/library/en-us/directx9_c/directx/graphics/ programmingguide/programmingguide.asp. [13] Watt, Alan, 3D Computer Graphics, Second Edition, AddisonWesley, 1993, pp. 229-237. [14] Imagire, Takashi, http://if.dynsite.net/t-pot/program/75_ shadow2Vol/index.html. [15] James, Greg, http://developer.nvidia.com/view.asp? IO=vertexshader_shadowvolumes. [16] Engel, Wolfgang F., Direct3D ShaderX: Vertex and Pixel Shader Tips and Tricks, Wordware Publishing, Inc., 2002, pp. 51-52, http://www.shaderx.com. [17] nVidia, NVASM vertex and pixel shader macro assembler, http://developer.nvidia.com/view.asp?IO=nvasm.
278
[18] Brennan, Chris, Shadow Volume Extrusion Using a Vertex Shader, Direct3D ShaderX: Vertex and Pixel Shader Tips and Tricks, pp. 188-194, http://www.shaderx.com. [19] ATI, Treasure Chest and Island demos, http://www.ati.com/ developer/demos/r8000.html. [20] Vlachos, Alex and Drew Card, Computing Optimized Shadow Volumes for Complex Data Sets, Game Programming Gems 3, Charles River Media, Inc., 2002, pp. 367-371. [21] Lengyel, Eric, T-Junction Elimination and Retriangulation, Game Programming Gems 3, pp. 338-343. [22] Bloom, Charles, http://www.cbloom.com/3d/techdocs/ shadow_issues.txt. [23] MilkShape 3D modeler, http://www.milkshape3d.com/. [24] Harvard University, Xianfeng Gu, Steven J. Gortler, Hugues Hoppe, Leonard McMillan, Benedict J. Brown, and Abraham D. Stone, Silhouette Mapping, Computer Science Technical Report: TR-1-99, http://research.microsoft.com/~hoppe/ silmap_tr_text.pdf. [25] Sander, Pedro V Xianfeng Gu, Steven J. Gortler, Hugues ., Hoppe, and John Snyder, Silhouette Clipping, ACM SIGGRAPH 2000, pp. 327-334, http://people.deas.harvard.edu/~pvs/ research/silclip/. [26] Lengyel, Eric, Tweaking a Vertexs Projected Depth Value, Game Programming Gems, Charles River Media, Inc., 2000, pp. 361-365. [27] Everitt, Cass and Mark J. Kilgard, Optimized Stencil Shadow Volumes, http://developer.nvidia.com/docs/IO/4449/SUPP/ GDC2003_ShadowVolumes.pdf. [28] nVidia, Understanding the w Coordinate, http://developer.nvidia.com/view.asp?IO=understanding_w.
Introduction
Many of the current challenges facing 3D graphics application developers are centered on creating and using programmable graphics shaders. These programmable graphics shaders are at the heart of all future graphics chips. With the introduction of the Radeon 9000, shaders are now supported on the entry-level PC and will soon trickle down to all other devices. Developers with the ability to create and use these programmable shaders are able to take advantage of all that the hardware offers and create applications that redefine the art of real-time graphics. In order to help developers unlock the creative potential of todays graphics chips and improve the shader prototyping and development process, ATI Technologies has developed the RenderMonkey Integrated Development Environment (IDE). Although writing assembly or High Level Shading Language code is the heart of the shader development process, shaders are more than just the code. Encapsulating shader-based effects can be a complex task, since it involves capturing the entire state of the system that is involved in rendering these effects. This leads to a common problem that currently exists among shader developers exchanging and sharing shaders is not a trivial task.
279
280
Another problem that many game developers face when starting to develop shaders is the need to closely involve artists in the process. Without tools that artists are comfortable with, it becomes difficult to collaborate on effect creation. Whats needed is an environment where not just the programmers but the artists and game designers can work together to create mind-blowing special effects using shaders. RenderMonkey is designed to solve many of these problems and facilitate the shader prototyping process for your game engines. With this tool, we provide a powerful programmers development environment for creating shaders, which can be used as a standard delivery mechanism to allow sharing of shader-based effects in the developer community. We also provide a flexible, extensible framework that supports easy integration of custom components and provides a solid basis for future tool development. RenderMonkey can be easily customized and integrated into any developers regular workflow. The design of the RenderMonkey IDE allows easy incorporation of current and future rendering APIs. By the time this book is published, you will be able to download version 1.0 of the program from ATIs web site (http://www.ati.com/developer). That version includes support for DirectX 9 shader effects (using both assembly and HLSL), as well as support for creating OpenGL-based effects using the GL2 High Level Shading Language. Although this chapter does not focus on the intricacies of writing shader code, there are some excellent chapters on that topic in this book and its companion book, ShaderX2: Shader Programming Tips & Tricks with DirectX 9. For those of you interested in learning the DirectX High Level Shading Language, you should read the Introduction to the DirectX High Level Shading Language article by Craig Peeper and Jason Mitchell, which appears in this book. There are also several articles in ShaderX2: Shader Programming Tips & Tricks with DirectX 9 that I coauthored with my colleagues, which focus on the development of interesting shaders. These articles all use RenderMonkey workspaces that you can load into RenderMonkey and experiment with. Take a look at Simulation of Iridescence and Translucency on Thin Surfaces (N. Tatarchuk, C. Brennan), Motion Blur Using Geometry
281
and Shading Distortion (N. Tatarchuk, C. Brennan, and J. Isidoro), Layered Car Paint Shader (C. Oat, N. Tatarchuk, and J. Isidoro), and Real-Time Depth of Field Simulation (G. Riguer, N. Tatarchuk, and J. Isidoro), as well as the Advanced Image Processing with DirectX 9 Pixel Shaders article by J. Mitchell, M. Ansari, and E. Hart. You will find a great deal of interesting material on developing spectacular visual effects in these articles.
282
A workspace view, which shows the effect workspace being edited An output window for compilation results and text messages from the application A preview window used to preview effects being edited Other editor modules, such as editors for shader code and GUI editors for shader parameters. Shader parameters can be tagged as artist-editable and then edited in a coherent way using the artist editor module.
n n
283
...where ka, kd, ks are the coefficients for ambient, diffuse, and specular light contributions, respectively. These parameters are assigned a constant value in the range of 0 to 1, according to the reflecting properties that we want the surface to exhibit. If we want a highly reflective surface, we set the values for kd and ks to be near 1. This produces a bright surface with the intensity of the reflected light near that of the incident light. To simulate a surface that absorbs most of the incident light, we set the reflectivity to a value near 0. Id is the intensity of the diffuse contribution of the point light source that we are simulating, and Is is the intensity of the specular contribution of that light source. Ia is ambient light r r r intensity. N , V , R are the normal, view, and reflection vectors, respectively. ns is the specular-reflection parameter, proportional to the angle f between the view and reflection vectors. Shiny surfaces have a narrow specular range (the angle between these two vectors is smaller), and dull surfaces have a wider reflection range. Thus, a very shiny model can be modeled with a large value for ns (around 100, for example), and a dull surface can be modeled using an ns value equal to 0.5. We use the equations (2) to (4) in the pixel shader to compute the resulting color for each pixel for this illumination model. But first lets start the application and start building the workspace for the effect.
284
and it is user-extensible. Best of all, any user can open an XML RenderMonkey file and read the file directly in Internet Explorer; its just another ASCII file format. To start working on a shader-based effect, we simply launch the application, which automatically starts out with a new empty workspace. All effect-related data is stored in the effect workspace using RenderMonkeys run-time data format. Each effect workspace consists of these elements:
n n n n n
Each effect group is used to encapsulate a series of related effects. For example, you may want to group all effects that use a noise function to render perturbation-based effects, such as clouds or fire or plasma, in one single effect group. Another good use for this node is grouping various implementations of a single effect for fallback rendering in your engine. Each effect group consists of one or more effect nodes. Each effect is used to draw a single, coherent visual effect in the viewer. You may have a single pass effect, or you may want to use several draw calls to generate the look that you want. But each draw call (or pass, as RenderMonkey refers to it) may consist of the following data:
n n n n n n
A render state block (optional) A vertex shader (required) A pixel shader (required) A geometry model reference (required) A stream mapping reference (required) One or more texture objects with valid texture references (optional) Variable nodes (optional)
285
All individual items in the RenderMonkey effect workspace are referred to as nodes.
Workspace View
The main window into the effect workspace is the Workspace view window. Thats the dockable window usually positioned on the left of the main interface containing a tabbed tree control, which provides a high-level view of the effect database. Figure 2 shows the Workspace view window:
The workspace view can be used to access all elements of the effect workspace. The intention is that individual effects will be grouped by their common attributes in an effect workspace. There are two tabs in the workspace tree view: the Effect tab and the Art tab. The Effect tab is used to view the entire workspace with all variables and passes visible. The Art tab is used to view only the artist-editable variables that are present in the workspace. Once an effect is developed by the programmer using the Effect tab, it can be handed over to the artists, who may want to just view the artist-editable data by simply selecting the Art tab to view the workspace. Lets start working on our effect. If you right-click on the workspace node, you can select the Add Effect Group menu option
286
from the context menu that appears. The context menu is shown in Figure 3:
When you add a new effect group to the workspace, RenderMonkey automatically populates the workspace with several nodes. It automatically adds a sample effect with one pass. The pass inside that effect contains sample vertex and pixel shaders and a sample geometry model. If you have ATI Radeon 8500 or a better type of hardware, you can see a red teapot in the preview window. If not, then you need to change the target for the pixel shader to ps_1_1 (I go over how to do that later in this article). RenderMonkey also adds a matrix variable for storing the view projection matrix called view_proj_matrix and a standard stream mapping node called standard mapping. A sample model node is added as well. This enables you to start right away with a fully functioning effect that you can build upon to create something more visually appealing than a red teapot.
287
Since we already know that we need several variables as input to our shaders, lets add them to our new workspace. To add a new variable, right-click on the node you want to add that variable to and select Add Variable from the context menu that appears (see Figure 4).
You can select one of the RenderMonkey-supported data types for your variable nodes:
n n n n
Scalar (a simple float variable) Vector (4D float variable) Matrix (4x4 float matrix) Color (4D float variable, RGBA color representation)
288
Texture variables:
o o o
The icons on the left of each node in the Workspace view help you quickly identify their node type. For example, vectors are represented by , scalars are represented by , colors by , matrices by , etc. By default, new scalar, vector, and matrix variables are created as not artist-editable. Color, texture, cube map, and volume texture variables are created as artist-editable. You have an option to make any new variable artist-editable by checking the Artist Editable check box in the Add Variable dialog. This is necessary to make a variable visible in the artist editor or on the Art tab in the Workspace view. If you wish to make any variable artist-editable at any point later on, you can also right-click on that variable and select the Artist Variable menu option. To remove the artisteditable property from a variable, right-click on the variable and select the Artist Variable menu option again. A check mark on that option indicates whether the variable is artist-editable or not. A small yellow flag on the variable icon indicates that the variable is artist-editable: .
view_proj_matrix: A variable of type matrix, which contains the view projection matrix
289
view_matrix: A variable of type matrix, which contains the view matrix inv_view_matrix: A variable of type matrix, which contains the inverse of the view projection matrix proj_matrix: A variable of type matrix, which contains the projection matrix time: A variable of type vector, which provides current time value cycled over the cycle that can be modified in the RenderMonkey Preferences dialog. By default, it is set to 120. cos_time: A variable of type vector, which provides the cosine of time sin_time: A variable of type vector, which provides the sine of
time
The easiest way to add predefined variables to your workspace is to select the appropriate type of predefined variable that you would like to use and then choose the name from the combo box that appears in the Name area of the Add Variable dialog (see Figure 6). Note that the combo box only appears if the selected type has some predefined variables. If the user then chooses another type for a given predefined variable name, it is not appropriately initialized at run time, as RenderMonkey identifies predefined variables by both name and type.
290
Predefined variables are easy to identify in any RenderMonkey workspace, as they will have a small green overlay on top of their usual variable type icon. For example, this is what a vector predefined variable icon would look like: .
291
Once a stream mapping node is created, you can edit its contents by double-clicking on the node or right-clicking on the stream mapping node and selecting Edit, which brings up the stream mapping editor module shown in the following figure.
We already know that to compute correct illumination results, we need the vertex normals as well as vertex positions as inputs to the vertex shader. Lets add that channel to the stream defined in our workspace. Double-click on the standard mapping node and bring up the stream editor. To add new channels to the stream setup, you can click on the Add Channel button in the stream mapping editor. Then you can select the desired input register and name the usage for that stream, the usage index, and type. If you want to delete a specific channel, you can click on the X button to the right of the channel. In Figure 9 below, I have added a second channel to bind the normals for vertices. Dont forget to set the data type for the normals channel to FLOAT3.
292
To actually use the stream mapping for a specific draw call, you need to add a stream map reference to the pass in which you would like to use it. To do that, you need to first make sure that youve created a stream mapping node (like standard mapping) somewhere in the workspace tree. Then you can select the pass to which you want to add the stream mapping reference (Pass 1 in our case) and right-click on that node. Select Add Stream Mapping Reference from that context menu (as you can see from Figure 10):
An empty stream mapping reference is then created. That reference is initially not linked to any stream mapping nodes. The red line on the stream mapping reference icon ( ) shows you that the reference isnt correctly resolved. To link a reference to a stream mapping node, you should right-click on the stream mapping reference node and select the Reference Node menu from where you can select the name of the actual stream mapping node that you would like to reference in that pass (as shown in Figure 11). You can also double-click on the stream
293
mapping reference node and rename the node to the name of the stream mapping node directly to link it.
To resolve scope for the stream mapping for a particular pass, RenderMonkey first checks the pass tree for a stream mapping instance. If neither a stream mapping instance nor a stream mapping reference is found, the application walks up the workspace tree to find the first stream mapping node or reference. Note that placing stream mapping nodes and references should be done with consideration since incorrect use of stream mapping nodes results in bad rendering results. If the stream mapping node name is found and resolved correctly, the stream mapping reference node will have this icon: . Note the small arrow in the icon that denotes that it is a reference rather than the actual stream mapping node. That convention is for all reference nodes in RenderMonkey, so you can easily spot references in the workspace.
Model Management
An important aspect of every visual effect is the actual geometry that gets rendered on the screen. RenderMonkey uses the model and model reference nodes to allow you, the user, to specify which geometry to render in each draw call. As you can already see by
294
this point, the workspace contains a model node under the main workspace node and a model reference node under the Pass 1 node. You can easily spot the model nodes by their red teapot icon: . The model reference nodes follow the convention described above for references and have a small arrow next to them: . To load a new geometry model into a model node, you double-click on that node and select a file containing your geometry object from the list of supported file formats that will be shown in the file open dialog. To actually bind the data from streams to the shaders, RenderMonkey uses the pairing of a stream map with a model data node done by adding both references to each pass to make sure that the necessary data is present at run time and then binds it to stream sources.
Managing Effects
Although we wont need to add any extra effects at this time, lets talk briefly about managing effects in RenderMonkey. As was said earlier, each effect in the workspace is used to draw a single, coherent visual effect in the viewer. It can consist of one or more draw calls. To create a new effect, you can right-click on the effect group to which you want to add the new effect. Select Add Effect from the context menu that appears (see Figure 12) to create a new effect at the bottom of this group:
Figure 12: Adding new effects to the workspace from the context menu
You can change the effect name at any point by simply renaming it. By default, when RenderMonkey adds a new effect, it adds a
295
single pass with HLSL vertex and pixel shaders in it. The main thing you want to do with the effects is view them. To do that, you should set the effect that you wish to render as an active effect. That means that this is the effect that will be rendered by the viewer module. To do that, you should right-click on the desired effect and select the Set as Active Effect menu option. You can easily check which effect is active in the workspace because it will appear in bold typeface.
296
At this point you need to select what type of shader you want to add to that effect. You have a choice of adding an assembly or HLSL shader to the pass. Figure 14 shows the dialog box that appears for that purpose:
Clicking OK will add a new shader to the selected effect. You can easily spot what type of shaders the effect has; DirectX assembly shaders will have the icon for the vertex shaders and for the pixel shader, and DirectX HLSL shaders will have the icon for the vertex shaders and for the pixel shaders. RenderMonkey will automatically choose the shader editor for each shader, depending on its version. Note that you can only have one of each vertex and pixel shader in an individual pass. If you wish to change shader types (for example, replace an assembly shader with an HLSL shader), you need to first delete the old shader and add a new one in its place.
Editing Shaders
Since we already have a pass with a pair of shaders, lets start working on the actual shader code at this point. To edit each shader, you should double-click on that shader node. RenderMonkey will open the shader editor for your shader. There is a single shader editor window for all the passes in a single effect. Figure 15 shows the shader editor user interface containing the HLSL vertex shader.
297
As you can see from the UI above, the shader editor has two tabs for a vertex and a pixel shader for each pass. The UI for the actual shader editing is selected according to the shader type; see Figure 16 for a snapshot of the assembly shader editor UI.
298
To edit shaders in a different pass, you simply need to select the pass from the top-left combo box in the main Shader Editor window. The tabs for vertex and pixel shaders will be updated to show the shaders in the new pass. You can use Ctrl+Tab to quickly switch between the vertex and pixel shader tabs.
: : : :
VS_OUTPUT main( float4 inPos : POSITION, float3 inNorm : NORMAL ) { VS_OUTPUT Out = (VS_OUTPUT) 0; // Output transformed position: Out.Pos = mul( view_proj_matrix, inPos ); // Output light vector: Out.Light = -lightDir; // Compute position in view space: float3 Pview = mul( view_matrix, inPos ); // Transform the input normal to view space: Out.Norm = normalize( mul( view_matrix, inNorm ) );
299
// Compute the view direction in view space: Out.View = - normalize( Pview ); return Out; }
This vertex shader transforms the vertex position and outputs it from the vertex shader. Then it computes the light vector using a shader parameter named lightDir (we will be adding all shader constants after were done creating our shaders). It also computes the vertex position in view space using another RenderMonkey predefined variable, view_matrix, and computes the view vector and the normal vector in view space and outputs those to the pixel shader. Before we add this code to the shader itself, lets go over the user interface for editing HLSL shaders in RenderMonkey first. The High Level Shading Language (HLSL) editor consists of three sections. The UI widgets at the top of the editor are used to manage shader parameters for HLSL shaders. The text editor control in the middle portion of the editor is used to view the declaration block of an HLSL shader that contains parameter declarations. This editor pane is not editable; the declaration block is solely controlled through the UI widgets in the top portion of the editor. This is necessary to ensure that the RenderMonkey variable nodes and texture objects get properly mapped to HLSL parameters. The bottom pane is the editor widget to edit the actual shader text (take a look at Figure 15 again). Note that once youre done mapping your constants and samplers, you can simply minimize the Constant Editor block by selecting the check box on top of it: . To map a RenderMonkey variable node (a vector, a color, a matrix, or a scalar node), you can left-click on the arrow button next to the variables Name label: . This action opens up a pop-up menu containing a list of all variable nodes within the scope of the shader being edited. You can then select a variable node from that pop-up menu:
300
At that point, the label under the Name column will change to the name of the node that you selected. Next you should click on the Add button to add that variable node to the declaration block and map it internally as a shader constant. You will then see the actual text declaring that variable appears in the declaration block of the shader. Lets add the light direction vector to our workspace and map it to a constant in the vertex shader that we are writing. Rightclick on the effect workspace node and select Add Variable. Then select Vector as the variable type and type lightDir in the name field. You can leave the Artist Editable check box empty if you wish. Clicking OK will add a new light direction vector to the workspace tree, and youll see a node like this in it: . Go to the vertex shader editor that we already opened, and follow the steps for mapping the light direction vector to a constant in the shader editor. After you click Add, you will see the text float4 lightDir; appear in the shaders declaration block. Weve just added our first constant to the shader! The next parameter that this shader uses is a view matrix. Since its a predefined RenderMonkey variable, you wont be able to modify its values explicitly. Lets add it to the workspace first. Right-click on the effect workspace node and select Add Variable. Select Matrix as the variable type. You will see that the Name edit field changed to a combo box. Expand that combo box and select view_matrix from the list of variables that appear. After clicking OK, you will add the predefined view matrix to your workspace. You should see this node appear in the workspace tree now: . The little green p icon at the bottom-left corner always lets you know that it is a RenderMonkey predefined
301
variable. Follow the same steps described above to add it to the vertex shader declaration block; you will now see the full declaration block appear (though not necessarily in that order):
float4x4 view_proj_matrix: register(c0); float4 lightDir; float4x4 view_matrix;
Now you can simply type the rest of the shader code (the actual main function and vertex shader output structure declaration) into the shader text editor window. Readers should note that for High Level Shading Language parameter definitions, the RenderMonkey nodes they desire to map must be named within the constraints of the High Level Shading Language; otherwise, improper naming will result in compilation errors. Please refer to the HLSL language manual for more information on naming conventions. By default, an HLSL shader entry point is set to main, which is actually what we want for both shaders. If you wish to change the entry point for your shader, you can do that by typing a different name in the entry point edit field: . Since every HLSL shader must provide a compilation target, we need to specify that as well. By default, RenderMonkeys HLSL added shaders have vs_1_1 and ps_1_4 shader targets. To change the version of the shader to which you are compiling, you should select from a list of available targets from the Target combo box: . The target sets are separate for pixel and vertex shaders please refer to High Level Shading Language documentation for an explanation of each target value. The bottom pane is used to enter the actual text of the shader. The shader text must contain at least one function with the same name as the specified entry point for the shader to compile. The shader text editor has High Level Shading Language customizable syntax coloring.
302
Output Window
The Output module is a docked window typically located on the bottom of the main application interface (see Figure 18). That window is used to output the results of shader compilation and other application text messages. The Output window is linked with the shader editor for compilation error highlighting.
303
shader and highlight the line containing an error (see an example in Figure 19). If you modified the shader text and then closed the editor without committing the changes, RenderMonkey will ask whether you would like to commit the changes first.
Editing Assembly
Although we do not edit assembly shaders in this particular example, this section describes how to edit assembly shaders. The assembly Shader Editor window consists of two panes; the top pane is used to bind RenderMonkey variable nodes to shader constant registers, and the bottom pane is used to edit the shader
304
text. You can see a snapshot of the assembly Shader Editor window in Figure 20.
The constant store editor is a list view with three columns. Each row represents values for one particular register. The first column (Constant) can be used to specify the index of the register for that constant. The second column (Name) shows the name of the node that is linked to that register (or if there isnt a variable linked to that register). The third column shows the initial value of the variable node linked to the register. Binding a RenderMonkey variable node to a constant store register means that the software will actually bind the internal values of the nodes directly to the register values. Within the RenderMonkey IDE, vector and color nodes are represented by four different floats, scalars are mapped to four floats having the same value, and matrices are represented by 16 floats.
305
To bind a RenderMonkey node to a register, you should right-click on the field in the Name column for the constant and select a variable node from the pop-up menu (see Figure 21). The pop-up menu contains all variables that are within the scope of the shader being edited. Once a node is selected, its name will appear in the Name column for the selected register, and the current values of the node will be displayed in the Initial Value column.
To clear a constant store register, you can select the Clear menu option from the pop-up menu for the register. The name of the variable previously linked to that node is replaced by , and the Initial Value column will be cleared. Please note that if you bind a matrix to a particular constant, the three constants below that constant are overwritten with the rows of that matrix. The source editor has support for customizable syntax coloring for pixel and vertex shader assembly code. There is also full clipboard support for standard editing operations.
306
ka, kd, ks: The coefficients for ambient, diffuse, and specular light contributions, respectively Ia: The ambient light intensity Id: The intensity of the diffuse contribution of the point light source that we are simulating Is: The intensity of the specular contribution of the point light source that we are simulating ns: The specular-reflection parameter, proportional to the angle f between the view and reflection vectors
n n
All of the parameters above need to be added as constants to the pixel shader, where we will be directly computing the result of the illumination equation. ka, kd, ks, and ns can be added as scalar variables to the workspace, and Ia, Id, and Is can be added as colors. You should add variable nodes with the following names and types to the main effect workspace node:
n n n n n n n
ka: Scalar variable named Ka kd: Scalar variable named Kd ks: Scalar variable named Ks ns: Scalar variable named Ns Ia: Color variable named Ia Id: Color variable named Id Is: Color variable named Is
Heres a snapshot of the workspace tree view that you will have once youve completed this operation:
307
Figure 22: Workspace with all parameters for Phong specular illumination model
Lets add these parameters to the pixel shaders declaration block. Go through each node that we just added to the workspace (Ka, Kd, Ks, Ns, Ia, Id, and Is) and add them to the pixel shader declaration using the steps described in the vertex shader editing section. Once youve finished adding the last node, you should see the following pixel shader declaration block appear:
float Ka; float Kd; float Ks; float Ns; float4 Ia; float4 Id; float4 Is;
At this point we are ready to start writing the code for our pixel shader. This is where we can really appreciate the simplicity and elegance of writing shaders using a High Level Shading Language (the Microsoft DirectX 9.0 HLSL in our example). If you have ever tried to write assembly shaders, you can certainly appreciate the difference. The code for the complete pixel shader (without the previous declaration block) follows:
308
float4 main( float4 Diff : COLOR0, float3 Normal : TEXCOORD0, float3 View : TEXCOORD1, float3 Light : TEXCOORD2 ) : COLOR { // Compute the reflection vector: float3 vReflect = normalize( 2 * dot( Normal, Light) * Normal - Light ); // Compute ambient term: float4 AmbientColor = Ia * Ka; // Compute diffuse term: float4 DiffuseColor = Id * Kd * max( 0, dot( Normal, Light )); // Compute specular term: float4 SpecularColor = Is * Ks * pow( max( 0, dot(vReflect, View)), Ns ); float4 FinalColor = AmbientColor + DiffuseColor + SpecularColor; return FinalColor; }
You can simply type that code into the pixel shader text editor and hit Commit Changes. Remember to set the target field for this pixel shader to ps_2_0 since we are using the pow instruction.
Preview Window
At this point we are done editing our shaders. But to actually see the effect of the code, we need to see the results in some sort of a viewer. In RenderMonkey, the preview window is used to interactively preview effects. All changes to a shader or its parameters update the rendered image in real time, thus truly enabling interactive shader development. Figure 23 shows the DirectX 9.0 preview window for an ocean water effect. Simple trackball navigation is provided in the standard RenderMonkey preview module:
n
309
To move the camera forward and backward, use the Z and X keys. To rotate the scene, use the mouse.
Note that the model is rotated about the z-axis in the preview window. The output of each render pass can be displayed in an arrayed viewport by the use of the P key (as shown in Figure 24 below):
310
You can also select from a set of predefined views for your model. To access that, right-click in the preview window and select a view from the list that will appear in that menu (Front/Back/Top/ Bottom/Left/Right). You can also modify the properties of the standard preview module by selecting Properties from the rightclick menu in the preview window. That action brings up a dialog that allows you to:
n n n n
Modify the clear color of the preview window Modify the clear color used for the pass array Modify the field of view Modify the near and far clip plane values
For the currently selected effect, the preview module has the ability to display each pass within a multipass effect in arrayed viewports.
Editing Variables
At this point, the preview window shows a white teapot in constant color. The reason for this look lies in the values for your variables. We need to set meaningful values for all of the parameters to our shaders. But before that, lets talk about how to edit variable nodes in RenderMonkey. To edit a variable, you can either double-click on the variable node or select Edit from the rightclick menu for that node. That action will bring up an automatically selected editor for that node type.
Scalar Variables
Each scalar can be edited via the scalar editor module shown in the following figure. Note that you can modify the values in any way, but if you arent happy with them at the end, you can simply click Cancel and the value set prior to opening the scalar editor will be restored. Note that at any point, the preview window will interactively show the changes.
311
The scalar can be edited by either directly typing the value in the main edit box or interactively using a pop-up slider, which is in the same range as the clamping bounds (regardless of whether or not the user chooses to clamp the vector). Lets set the values for the scalars in our workspace to the following values:
n n n n
Note that right after you do that, you see a white teapot in the preview window. Weve turned on our illumination!
Vector Variables
Each vector can be edited via the vector editor module:
Each vector component can be edited by either directly typing the value in the component edit box or interactively using a pop-up slider for each component. The sliders ranges will be the same as
312
the clamping bounds for the vector (regardless of whether or not the user chooses to clamp the vector). The user may also select to keep the vector normalized by selecting the Keep (x, y, z) components normalized check box. You can revert your changes in the same way as you could in the scalar editor by pressing the Cancel button. Lets set the values for the light direction vector for our vertex shader. Double-click on the lightDir variable and enter the following values as its components:
n n n n
Matrix Variables
Although we arent going to modify any matrix variables in this example, to edit a matrix variable you can use the matrix editor module shown below:
Each matrix component can be edited by either directly typing the value in the component edit box or interactively using a pop-up slider for each component. The slider range is preset to be in the range [100.0, 100.0]; however, typing a value outside of that range expands the range to that value. The user can also set the matrix to an identity matrix by clicking the appropriately named
313
button. You can revert your changes in the same way as you could in the scalar editor by pressing the Cancel button.
Color Variables
Each color variable can be edited via the color picker module:
The user can edit color using either RGB or HSV mode by directly typing the values in the appropriate edit boxes for each component (R, G, B, A or H, S, V A), interactively selecting color , from the color wheel or color sliders for each component, or modifying the intensity of the color being edited by using the vertical intensity slider. The value of the color is shown in the color swatch at the top-left corner of the color picker. You can also choose to edit color values directly in floating-point format by checking the Floating Point check box and typing the values in the range [1.0, 1.0] directly into the Red, Green, Blue, and Alpha edit fields. The negative values can be used in the shaders to subtract
314
colors. You can also revert your changes the same way you could in the scalar editor by pressing the Cancel button. If we set the values for the Ia parameter to R = 0, G = 112, B = 0, and A = 255, we can see the image in Figure 29: our Phong-shaded teapot!
315
Figure 30: Adding the render state block from the pass context menu
render state block found in the workspace tree. If there are no other render state blocks found prior to the one created, it does not inherit any values. Changing the render state values in the created render state node overrides inherited values. Note that for upward traversal, the application only looks in the passes within the current effect and the default effect. The render state blocks in other effects dont propagate their values. To edit any of the render states in a render state block, you can double-click on the render state node or right-click on the node and select Edit from the node context menu. The render state editor window will appear, as shown in Figure 31 on the following page. To edit a particular render state, click in the Value column for that render state and either select from a set of predefined values or type a value directly if none were supplied (see the above example for the blending op).
316
Lets display our celadon teapot in wireframe. Thats very simple to do find the Fillmode render state in the editor and set its value to WIREFRAME by right-clicking in the Value column and selecting that option from the menu. You will instantly see the teapot displayed in the preview window in wireframe:
317
Texturing in RenderMonkey
All games these days use various texture maps for their visual effects. Lets learn how to use texture maps in RenderMonkey. As you have learned previously, RenderMonkey has special variable types for 2D textures, cube maps, and volume textures. Lets add a 2D texture map variable to our workspace. Right-click on the effect workspace node and select Add Variable from the menu. Select Texture as type, and type baseMap into the name field. By default, all textures are added as artist-editable variables. You will see a texture variable appear in your workspace. Next, in order to use texturing for our effect, we need to have texture coordinates stream into the vertex shader. Double-click on the stream map node named standard mapping and add the third channel for texture coordinates: Reg = v2, Usage = TexCoord, UsageIndex = 0, Type = Float2. That creates a new stream channel to feed to the vertex shader. The next step is to add texture coordinate propagation to the vertex shader. Thats very simple open the shader editor for the vertex shader, and type the following code. The lines shown in bold are the lines that are different from the previous examples vertex shader:
struct VS_OUTPUT { float4 Pos float3 Norm float3 View float3 Light float2 Tex }; VS_OUTPUT main( float4 inPos float3 inNorm float2 inTex { VS_OUTPUT Out
: : : : :
318
// Output transformed position: Out.Pos = mul( view_proj_matrix, inPos ); // Output light vector: Out.Light = -lightDir; // Compute position in view space: float3 Pview = mul( view_matrix, inPos ); // Transform the input normal to view space: Out.Norm = normalize( mul( view_matrix, inNorm ) ); // Compute the view direction in view space: Out.View = - normalize( Pview ); // Propagate texture coordinate to the pixel shader: Out.Tex = inTex; return Out; }
This forces the vertex shader to propagate texture coordinates to the pixel shader. But to actually sample textures in the pixel shader, we need to bind our texture variable to a texture object.
Texture Objects
To use texture-based variables, you have to first create a texture variable using the Add Variable dialog in the desired location of the workspace. Once that texture variable is created, you need to select a file from which to load the texture. To actually use a texture within a pass, you need to select the desired pass and select the Add Texture Object menu option, as shown in Figure 33 on the following page:
319
This creates an empty texture object. The texture object that doesnt have a valid texture reference appears with a red line through it: . Texture objects map to texture stages used in your shaders, and they are also used to store texture stage and sampler states associated with that texture stage or a sampler. To actually use a texture object in the shader, we need to add a texture reference to it. To do that, right-click on the Pass 1 node and select the Add Texture Reference menu option from the context menu that appears (shown in Figure 34):
320
This creates an empty texture reference. To actually bind the reference to a texture variable, the user should type the name of the variable that he wants to reference. If a valid texture variable is found successfully, then the red line through the texture reference is removed. A red line across the texture reference icon denotes that the texture variable wasnt successfully referenced. By default, RenderMonkey binds the texture reference to the baseMap texture variable if one is found in the workspace, so we dont need to do anything to bind our texture reference. If we want to specify some sampler states for our texture map, we need to specify these state values (filtering, clamping, etc.) for a particular texture reference node using the texture editor, which can be launched by double-clicking on a texture reference node. Figure 35 shows the texture editor for three texture objects:
321
The texture editor has tabs for each individual pass within an effect. The top of the texture editor contains a list of texture references within the selected pass. By clicking on a texture icon, you can select to view and set texture states for that texture. To set a particular state, you should click on the Value field next to the state you are trying to edit and either select a value from the predefined set of values for that state or type a value if none was provided. Note that the texture editor displays thumbnails for all texture variables that have a valid file associated with them, and you can see a small icon in the bottom-left corner of each thumbnail showing what type of texture reference it is. Also note that only the texture objects with valid texture references have the icon or a thumbnail image. If the texture objects texture reference isnt correctly linked, then that object is displayed with the icon. The texture editor creates thumbnails for all texture variables; however, for cube maps or volume textures, only the first image is displayed.
322
Below is the text of the pixel shader modified to use texturing (the lines in bold are updated from the previous example):
float4 main( float4 float3 float3 float3 float2 { Diff Normal View Light Tex : : : : : COLOR0, TEXCOORD0, TEXCOORD1, TEXCOORD2, TEXCOORD3 ) : COLOR
323
// Compute the reflection vector: float3 vReflect = normalize( 2 * dot( Normal, Light) * Normal - Light ); // Compute ambient term: float4 AmbientColor = Ia * Ka; // Compute diffuse term: float4 DiffuseColor = Id * Kd * max( 0, dot( Normal, Light )); // Compute specular term: float4 SpecularColor = Is * Ks * pow( max( 0, dot(vReflect, View)), Ns ); float4 FinalColor = (AmbientColor + DiffuseColor) * tex2D( baseMap, Tex ) + SpecularColor; return FinalColor; }
Once you compile this shader, you will see a nicely textured teapot appear in the preview window:
324
Rendering to a Texture
Lets complicate our effect a little bit. Lets use the output of the first pass (the one that we just created) and funnel it as the input into the second pass. That technique is called rendering to a texture, and it can be used for a variety of interesting post-processing effects. (Take a look at the Real-Time Depth of Field Simulation (G. Riguer, N. Tatarchuk, J. Isidoro) article in ShaderX2: Shader Programming Tips & Tricks with DirectX 9 for an example of depth of field effects using that technique.)
Render Passes
To start working on creating the simplest rendering to a texturebased effect, we need at least two passes. Lets add a new pass to our workspace. To do that, right-click on the effect node and select Add Pass from the menu. By default, each pass is created with a sample HLSL vertex and pixel shader and geometry and stream map reference nodes; you can modify those at any time. Once you add a new pass, you can see a red teapot appear in the preview window again. Thats because the passes are drawn in the order in which they are defined within their parent effect. To move a pass up or down, you can right-click on the desired pass and select Move Up or Move Down from the pass context menu shown in Figure 37. You may also use Ctrl+up arrow to move a pass up or Ctrl+down arrow to move the pass down. Try that with the two passes that we have; if you move Pass 2 to be above Pass 1, you will see the textured teapot again. Then if you move it back, the red teapot appears again. You can also disable a particular pass to aid you in your shader debugging. To do that, you can select Enable/Disable Pass from the pass context menu (accessible by right-clicking on the desired pass). A disabled pass will have this icon on the left of its name to denote that it is disabled: . To enable the pass, just click on the same menu option again.
325
Figure 37: Modifying the pass order from the context menu
326
1. Create a renderable texture at any point in the workspace. Only one pass can render output to that texture at a time. To add a renderable texture, click on any node that you would like to add it to and select Add Renderable Texture from the context menu that appears at that point:
2. You will see a new node appear in the tree with this icon: . This node is the renderable texture node that you will link later to a render target and a texture object to sample from this renderable texture. 3. Next you need to add a render target to the pass that is going to output to the renderable texture. Select the pass node and right-click on it to select the context menu for that pass; choose Add Render Target to add a new render target (the node will have this icon next to it once its created: . ).
327
4. Next you must link the render target node to the renderable texture that youve created. You can either rename the render target node to exactly the same name as the renderable texture node to which you want to link it, or you can right-click on the render target node and select a node to reference from a context menu that will appear:
Figure 41: Linking the render target node to a renderable texture variable
328
5. At this point, the output of the pass that owns the render target node is drawn to the renderable texture. 6. Next, lets link the renderable texture to a pass that is going to sample from it. To do that, you must first create a texture object and a texture reference within that pass (see the section on managing textures above). Once a texture reference exists, you must link it to the renderable texture by either renaming the texture reference node to exactly the same name as the renderable texture or by right-clicking on the texture reference node and selecting the renderable texture you want to link it to from the Reference Node menu:
Figure 42: Linking a texture object to a renderable texture variable for sampling
7. At this point, you can use the texture object as you would normally use it in your shader (assembly or HLSL).
329
Lets add a renderable texture to our workspace. Right-click on the effect workspace node and select Add Renderable Texture from the context menu. Then we need to add a render target to Pass 1 right-click on that pass and select Add Render Target. Link this render target to the renderable texture that we have created by right-clicking on the render target and selecting renderTexture from the Reference Node menu that appears. You will see that the red line across the render target node disappears and the name of the render target is now renderTexture. At this point, the output of Pass 1 is diverted to the renderable texture variable. Next we want to add the ability to sample from that texture in our second pass. First we need to make sure that the vertex shader propagates the texture coordinates correctly. Type this text into the vertex shader:
struct VS_OUTPUT { float4 Pos: POSITION; float2 Tex: TEXCOORD0; }; VS_OUTPUT main( float4 Pos: POSITION, float2 Tex: TEXCOORD0 ) { VS_OUTPUT Out = (VS_OUTPUT) 0; Out.Pos = mul( view_proj_matrix, Pos ); Out.Tex = Tex; return Out; }
This ensures that we will be interpolating texture coordinates into the pixel shader. Next, lets add a texture object with a texture reference to Pass 2, following the same steps as in the earlier example. However, instead of linking the texture variable, lets link it to the renderTexture renderable texture variable. This directs the output of Pass 1 to Pass 2. Open the pixel shader for Pass 2 and add renderTexture as a sampler to that pass. Then type this text as the pixel shader code:
330
float4 main( float4 Diff : COLOR0, float2 Tex : TEXCOORD0 ) : COLOR { return tex2D( renderTexture, Tex ); }
At this point the preview window shows a green textured teapot (take a look at Figure 43). Set these sampler states for the texture objects in both passes for a nicer rendering result: Minfilter = LINEAR and Magfilter = LINEAR. (The picture below has the preview windows clear color set to a dark gray value.)
331
In the renderable texture editor you can change the dimensions of the renderable texture. To change either the width or height of the texture, type the integer dimension into the appropriate edit box and press Enter to propagate the changes and create a new renderable texture. You may also bind the texture to use the dimensions of the current viewport by checking the Use viewport dimensions box. To change the format of the renderable texture, the user can select from a list of predefined formats by selecting them from the Format combo box control.
332
From this editor, the user can select whether to clear the renderable texture by checking or unchecking the Enable color clear box. If the user chooses to clear the texture, he can select the color he wishes to clear it to by clicking on the Clear Color button and selecting the color from the dialog that appears. The user can also select whether to enable depth clearing by checking or unchecking the Enable depth clear box. If depth clearing is enabled, the user can select the value used.
Artist Editor
One of the problems that shader developers face in production is how to present the shaders to the 3D artists to allow the artists to experiment with the shader parameters in order to achieve desired effects. RenderMonkeys solution for this problem is the artist editor module combined with the Art tab in the workspace view.
333
A shader developer can select certain variables in the shader effect workspace to be flagged as artist-editable variables. To do that, you can select Artist Editable from the right-click menu for the desired variable node, and a small yellow flag icon will be overlaid over the icon for that variable. Then you can give the Effect Workspace with your shaders to the artists you work with. The artists can select the Art tab from the workspace view to only view artist variables present in the workspace. For added convenience, artists can edit artist variables of supported types in the artist editor module. Currently, the supported types for the artist editor are vectors, scalars, and colors; however, any variable can be flagged as an artist variable and accessed from the Art workspace tab. To open the artist editor, you can either click the button on the application toolbar or select Artist Editor from the View menu in the main application menu bar.
334
The Artist Editor window has tabs for each effect workspace, effect group, effect, or pass that contains artist-editable variables. If the node contains no artist-editable variables of supported types, it wont appear as a tab in the artist editor. Artist-editable variables are arranged by their types in groups (color, vector, and scalar). Each group can be expanded or collapsed by clicking on the button within the group.
335
Figure 48: Individual set of controls for editing color in the artist editor
If you click on the button, you will get an expanded set of controls for editing color with more precision, as shown in Figure 49.
Figure 49: Expanded set of controls for editing color variables in the Artist Editor window
Vectors
Each vector variable has five related controls a label button that opens up the full vector editor and four component edit boxes with pop-up slider buttons for editing each vector component interactively:
336
If the user clicks on the button for a particular vector ( ), he will see an expanded set of controls for editing vectors with more precision and control:
Figure 51: Expanded set of controls for editing vectors in the artist editor
Scalars
Each scalar has two related controls a label button that opens up the full scalar editor and an edit box with a pop-up slider button for editing the slider value directly.
If the user clicks on the button, he can see an expanded set of controls for editing scalar variables in the artist editor:
Figure 53: Expanded set of controls for editing scalars in the artist editor
337
Summary
I hope that this article was helpful in showing you the ease of use and convenience of developing shaders with the RenderMonkey IDE. As with all the tools and samples provided by ATI, we welcome feedback from the developers who spend every day in the trenches solving real problems. ATI is committed to providing you with the tools that you need to make your job easier. In order to do this, we need you to tell us what works and what doesnt. What additions or enhancements would you like to see? What additional problem areas exist that were not currently helping with? Please help us to help you by providing as much feedback as possible to [email protected].
Certain shaders, such as bump-mapping shaders, require the use of tangent space. (For more information on tangent basis and its use in bump-mapping, please refer to nVidias The CG Tutorial.) Since 3D model data typically comes with only vertices, normals, and texture coordinates, a common method is to automagically deduce the corresponding tangents and binormals (the tangent basis consists of the normal, binormal, and tangent) based on normals and texture coordinates. This method is convenient and effective, but sometimes it can produce undesirable artifacts. This is due to the following factors:
n n
It requires suitable texture coordinates. It is influenced by vertex weight, or the number of triangles sharing the same vertex. It is ideal for models with convex surfaces but presents problems for models with indentations or protrusions.
339
340
341
342
343
Figure 5: The tesselation has a great effect on bump-mapping and specular highlights.
Use a modeling tool that allows for tweaking of normals, tangents, and binormals. Some tools support normal tweaking, but tangent and binormal adjustment is rare. Break the model apart. Figure 6 shows such a case. The result is a total discontinuity between the two meshes. Additional tesselation to buffer or soften the effects of the discontinuity. Figure 7 shows the result. This actually preserves a little continuity, as seen by the highlights around the indentation compared to Figure 6.
344
345
Conclusion
The combination of the three methods (generating suitable texture coordinates, re-tesselating to distribute vertex weights more evenly, and buffered tesselation to soften the effect of discontinuity) is effective in creating complex models that would render bump maps, specular highlights, and other tangent basis-dependent effects correctly. It does not need changes to modeling tools or shaders. Instead, it only requires a little more work on the part of the modeler to tweak the model to become shader-friendly.
Color Plate 1. (Cook-Torrance lighting) Rendering with various refraction index values with pixel shader 1.4 (top row) and pixel shader 2.0 (bottom row). Roughness is constant at 0.15. The index of refraction is 0.15, 0.45, and 0.85 (left to right). Note the visibility of the face edges and error (crack) in the middle of the large highlights in the 1.4 version. (See page 147.)
Color Plate 7. Shadowed scene with shadow volume exposed (See page 276.)
347
Index
A animated fog, 174-176 implementing, 176-178 approximations, using for optimization, 269-270 arbitrary source swizzling, 69-70 using with destination write masks, 70 artist editor module, in RenderMonkey, 332-336 using to edit variables, 334-336 assembly language and DirectX, 4-6 assembly-level shader models, 4-6 assembly shaders, editing in RenderMonkey, 303-305 B back capping, 231-232 _bias modifier, 48 bilinear filtering, 186 branching, dynamic, 44 static, 43-44 _bx2 modifier, 47-48 C _centroid modifier, 81 clipping, 205 problems with, 212-219 col_major modifier, 13 color variables, editing in RenderMonkey, 313-314, 334-335 command-line options, 8 compile target, 6 modifiers, 46-50 compile targets, using ps_1_x, 46-47 complement modifier, 49-50 const modifier, 12 constant table, 26, 59-61 example of, 26-27 constructors, working with in HLSL, 15 Cook-Torrance lighting model, 134-136 HLSL pixel shader example, 145-147 HLSL vertex shader example, 143-145 pixel shader 1.4 example, 142-143 pixel shader 2.0 example, 138-140 vertex shader 2.0 example, 136-138 cube map environment mapping, 108-109 cube map space, 109 D _d2 destination write modifier, 49 _d4 destination write modifier, 49 _d8 destination write modifier, 49 D3DX Effects, 51 using with HLSL, 51-58 data, preprocessing, 245-248 data input, 25 uniform, 25-27 varying, 27-29 data output, 29-31 data set, processing, 267-269 data type declarations, 44-45 data types, in HLSL, 9-12 matrix, 11-12 scalar, 9-10 vector, 10-11 degenerate quads, 246 drawbacks to using, 260-262 using, 247-249 depth bias, 183-185 depth clamping, 218-219 depth comparison, 183-185 depth-fail, 205-209, 275
348
Index
and view frustum clipping, 212-215 drawbacks of, 209-219 example, 238-241 two-sided, 264-266 depth-pass, 201-205, 275 and view frustum clipping, 212-214 drawbacks of, 204-205, 209-219 example, 233-238 destination write masks, using with arbitrary source swizzling, 70 using with texture instructions, 70-71 DirectX and assembly language, 4-6 draw call, 284 dual paraboloid environment mapping, 108-109 dynamic branching, 44 dynamic flow control, 66-69 E edge elimination, 221-222 effect, 51 group, 284 nodes, 284 workspace, 284 effect API, 57-58 effect file example, 52-57 effects, managing in RenderMonkey, 294-295 environment mapping, 108-109 HLSL pixel shader example, 120-121 HLSL vertex shader example, 119-120 pixel shader 1.4 example, 115-117 pixel shader 2.0 example, 117-119 vertex shader 2.0 example, 112-114 environmental fog, 151 errors, checking in RenderMonkey, 302-303 exponential fog, 157-158 implementing, 159-161 exponential squared fog, 162-163 implementing, 164-166 expp instruction, 45 extern modifier, 13
F face register, 78-79 finite shadow volume, 209-210 implementing, 250-256 flow control, 66-69 dynamic, 66-69 static, 66 using to optimize shader, 42-44 fog, 151 animated, see animated fog calculating, 152-153 exponential, see exponential fog exponential squared, see exponential squared fog layered, see layered fog linear, see linear fog fog effects, adding, 151 Fresnel term, using, 111 front capping, 231-232 rendering, 241-243 fxc command-line compiler, 7-8 G geometries, preprocessing, 245-248 ghost shadows, 210-212 gradient instructions in ps_3_0, 80-81 H High Level Shading Language, see HLSL HLSL, 1 constructors in, 15 initializing variables in, 14 invoking compiler, 58-61 keywords, 8-9 modifiers, 46-50 optimizing, 39-51 storage class modifiers, 13-14 structures in, 17 type casting in, 15-17 type modifiers, 12-13 using to implement shadow volumes, 262 using with D3DX Effects, 51-58 vectors in, 14-15 HLSL data types, 9-12 matrix, 11-12 scalar, 9-10
Index
349
vector, 10-11 HLSL pixel shader, 3-4 Cook-Torrance lighting example, 145-147 environment mapping example, 120-121 example, 35-39 Oren-Nayar lighting example, 133-134 Phong lighting example, 95-97 HLSL shader, 2-4 drawbacks to using, 6-7 using textures with, 322-323 HLSL vertex shader, 2-3 Cook-Torrance lighting example, 143-145 environment mapping example, 119-120 example, 32-35 Oren-Nayar lighting example, 131-132 Phong lighting example, 94-95 I infinite shadow volume, 200 implementing, 256-260 input, declaring, 64-65 input type declarations, 44-45 instruction count limitations, 46-47 instructions, in ps_3_0, 80-81 in vs_3_0, 73 integer data type, using to optimize shader, 41-42 intrinsics, 19 math, 20-22 sampling, 23-25 invisible fillrate, minimizing, 270-271 L Lambertian model, 125 layered fog, 166-168 implementing, 168-173 light source management, 271 light sources, culling, 271-272 lighting model concepts, 122-125
lighting models, Cook-Torrance, 134-136 Oren-Nayar, 125-127 Phong, 84-85 linear fog, 154 implementing, 155-156 lit instruction, 45 log instruction, 45-46 logp instruction, 45-46 loops, using for optimization, 42 M masking, 124-125 math intrinsics, 20-22 matrix data type, using for optimization, 40-41 matrix variables, editing in RenderMonkey, 312-313 model node, 293-294 model reference node, 293-294 modifiers in HLSL, 12-14 N negate modifier, 50 nodes, 285 normalization, 92 NPR Metallic example, 31-39 HLSL pixel shader, 35-39 HLSL vertex shader, 32-35 O occluder, 199-200 culling, 272-273 optimization, data type declaration, 44-45 HLSL, 39-51 precision, 45-46 ps_1_x, 51 shadow volumes, 267-275 using flow control, 42-44 using integer data type, 41-42 using loops, 42 using matrix data type, 40-41 Oren-Nayar lighting model, 125-127 HLSL pixel shader example, 133-134 HLSL vertex shader example, 131-132
350
Index
pixel shader 2.0 example, 127-131 output, declaring, 64-65 P paraboloid environment mapping, 108-109 pass, 284 PCF, see percentage closer filtering percentage closer filtering, 185-186 per-pixel Phong lighting, 84-85 see also Phong lighting pixel shader 2.0 example, 89-93 vertex shader 2.0 example, 86-89 Phong lighting, 84-85 see also per-pixel Phong lighting HLSL pixel shader example, 95-97 HLSL vertex shader example, 94-95 using, 282-283 vertex shader example, 298-301 pixel shader, editing in RenderMonkey, 306-308 input semantics, 29 output semantics, 30 pixel shader 1.4, Cook-Torrance lighting example, 142-143 environment mapping example, 115-117 using, 140-142 vs. pixel shader 2.0, 97, 147-148 pixel shader 2.0, Cook-Torrance lighting example, 138-140 environment mapping example, 117-119 Oren-Nayar lighting example, 127-131 per-pixel Phong lighting example, 89-93 shadow map generation example, 188 shadow rendering example, 190-194 vs. pixel shader 1.4, 97, 147-148 pixel shader 3.0, 97-98 four-spotlight example, 103-108 position register, 79 precision, optimization issues with, 45-46
predefined variables in RenderMonkey, 288-290 predicate register, 65-66 predication, 65-66 procedural wood example, pixel shader, 3-4 vertex shader, 2-3 ps_1_x compile target modifiers, 46-50 ps_1_x compile targets, using, 46-47 ps_1_x optimization, 51 ps_3_0 features, 64-71, 78-82 R reflection vector, calculating, 109-111 registers, in ps_3_0, 78-79 in vs_3_0, 71-72 render passes, 324-325 render states, managing in RenderMonkey, 314-316 render target, editing, 332-336 renderable texture, editing, 331 RenderMonkey, adding shaders with, 295 artist editor module, 332-336 checking errors in, 302-303 compiling shaders in, 302 editing assembly shaders in, 303-305 editing pixel shaders in, 306-308 editing render targets in, 332-336 editing renderable textures in, 331 editing shaders with, 296-298 editing variables in, 310-314, 334-336 IDE, 281-282 managing effects in, 294-295 managing render states in, 314-316 rendering to texture with, 324-330 texturing in, 317-323 using, 279-280 using to render a specular material, 282-283 using variables with, 286-290 re-tesselation, 341-344 roughness, 123-124 row_major modifier, 13
Index
351
S samplers, 4, 17-19 examples of, 17-19 _sat modifier, 50 saturate modifier, 50 scalar variables, editing in RenderMonkey, 310-311, 336 scene management, 271-274 semantics, 2-3 shader 3.0 model, 63 shader input, 25 uniform, 25-27 varying, 27-29 shader output, 29-31 shaders, see also vertex shader, pixel shader, HLSL shader adding with RenderMonkey, 295 advantages to using, 260-262 compiling in RenderMonkey, 302 drawbacks to using, 260-262 editing in RenderMonkey, 296-298 NPR Metallic, 31-39 procedural wood, 2-4 shadow map, 182 filtering, 185-186 shadow map generation, pixel shader 2.0 example, 188 vertex shader 2.0 example, 187 shadow mapping algorithm, 182-183 shadow rendering, pixel shader 2.0 example, 190-194 vertex shader 2.0 example, 188-189 shadow volume capping, 207-209, 231-233 rendering, 241-243 shadow volumes, 197-201 advantages of, 198 forming, 225-230, 249-250 implementing, 201 implementing on CPU, 220-243 implementing on GPU, 243-262 implementing with HLSL, 262 infinite, 200 multiple, 207-208 optimizing, 267-275 overlapping, 203-204 rendering, 241-243
steps for implementing, 220, 244-245 techniques, 201 shadowing, 124-125 shadows, importance of, 181-182, 197 shared modifier, 13 silhouette clipping, 269-270 silhouette determination, 221-225 silhouette mapping, 269 spherical coordinates, 122-123 standard mapping node, 286 static branching, 43-44 static flow control, 66 static modifier, 13 stencil buffer, 199 stencil shadow volumes, see shadow volumes storage class modifiers in HLSL, 13-14 stream mapping node, 290 using, 290-293 structures, working with in HLSL, 17 surface roughness, 123-124 swizzling, 69-70 T tangent space, drawbacks to using, 339 technique, 51-52 texture, editing renderable, 331 rendering to, 324-330 using with HLSL shaders, 322-323 texture coordinates, generating, 340-341 texture editor, 320-321 texture instructions, using with destination write masks, 70-71 texture object, creating, 318-319 texture reference, creating, 319-320 texture sampling, 325-330 in ps_3_0, 82 in vs_3_0, 73-76 intrinsics, 23-25 texturing in RenderMonkey, 317-323 two-sided depth-fail, 264-266 two-sided stenciling, 263-264 render states, 264 type casting in HLSL, 15-17 type modifiers in HLSL, 12-13
352
Index
U uniform data input, 25-27 uniform modifier, 13 V variables, creating in RenderMonkey, 286-288 editing in RenderMonkey, 310-314, 334-336 initializing in HLSL, 14 predefined in RenderMonkey, 288-290 varying data input, 27-29 v-cavities model, 123 vector variables, editing in RenderMonkey, 311-312, 335-336 vectors, working with in HLSL, 14-15 vertex shader, animated fog example, 176-178 exponential fog example, 159-161 exponential squared fog example, 164-166 finite shadow volume example, 250-256 infinite shadow volume example, 256-260 input semantics, 28 layered fog example, 168-173 linear fog example, 155-156 output semantics, 30 Phong illumination example, 298-301
vertex shader 2.0, Cook-Torrance lighting example, 136-138 environment mapping example, 112-114 per-pixel Phong lighting example, 86-89 shadow map generation example, 187 shadow rendering example, 188-189 vertex shader 3.0, 97-98 four-spotlight example, 98-102 vertex stream frequency, in vs_3_0, 76-78 vertex weight, 341-344 view frustum clipping, 212-219 and depth-fail, 212-215 and depth-pass, 212-214 vs_3_0 features, 64-71, 71-78 W welded meshes, using, 267-268 workspace view, 285-286 X _x2 destination write modifier, 49 _x2 modifier, 48-49 _x4 destination write modifier, 49 _x8 destination write modifier, 49 Z z-fail, see depth-fail z-pass, see depth-pass
Looking
Check out Wordwares marketfeaturing the following new
Visit us online at
for more?
leading Game Developer s Library releases and backlist titles.
Shader902X
About the CD
The companion CD contains examples and source code discussed in the articles. The files are organized into folders named for each article, although there may not be an example for every article. Each folder and/or subfolder includes a readme.txt document that explains the examples, contains instructions, and lists hardware requirements. Simply place the CD in your CD drive and select the folder for the example you would like to see.
6 Warning:
By opening the CD package, you accept the terms and conditions of the CD/Source Code Usage License Agreement on the following page. Additionally, opening the CD package makes this book nonreturnable.
6 Warning:
By opening the CD package, you accept the terms and conditions of the CD/Source Code Usage License Agreement. Additionally, opening the CD package makes this book nonreturnable.