NOTE! This copyright does *not* cover user programs that use kernel services by normal system calls - this is merely considered normal use of the kernel, and does *not* fall under the heading of "derived work". Also note that the GPL below is copyrighted by the Free Software Foundation, but the instance of code that it refers to (the Linux kernel) is copyrighted by me and others who actually wrote it. Also note that the only valid version of the GPL as far as the kernel is concerned is _this_ particular version of the license (ie v2, not v2.2 or v3.x or whatever), unless explicitly otherwise stated. Linus Torvalds ---------------------------------------- GNU GENERAL PUBLIC LICENSE Version 2, June 1991 Copyright (C) 1989, 1991 Free Software Foundation, Inc. 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. Preamble The licenses for most software are designed to take away your freedom to share and change it. By contrast, the GNU General Public License is intended to guarantee your freedom to share and change free software--to make sure the software is free for all its users. This General Public License applies to most of the Free Software Foundation's software and to any other program whose authors commit to using it. (Some other Free Software Foundation software is covered by the GNU Library General Public License instead.) You can apply it to your programs, too. When we speak of free software, we are referring to freedom, not price. Our General Public Licenses are designed to make sure that you have the freedom to distribute copies of free software (and charge for this service if you wish), that you receive source code or can get it if you want it, that you can change the software or use pieces of it in new free programs; and that you know you can do these things. To protect your rights, we need to make restrictions that forbid anyone to deny you these rights or to ask you to surrender the rights. These restrictions translate to certain responsibilities for you if you distribute copies of the software, or if you modify it. For example, if you distribute copies of such a program, whether gratis or for a fee, you must give the recipients all the rights that you have. You must make sure that they, too, receive or can get the source code. And you must show them these terms so they know their rights. We protect your rights with two steps: (1) copyright the software, and (2) offer you this license which gives you legal permission to copy, distribute and/or modify the software. Also, for each author's protection and ours, we want to make certain that everyone understands that there is no warranty for this free software. If the software is modified by someone else and passed on, we want its recipients to know that what they have is not the original, so that any problems introduced by others will not reflect on the original authors' reputations. Finally, any free program is threatened constantly by software patents. We wish to avoid the danger that redistributors of a free program will individually obtain patent licenses, in effect making the program proprietary. To prevent this, we have made it clear that any patent must be licensed for everyone's free use or not licensed at all. The precise terms and conditions for copying, distribution and modification follow. GNU GENERAL PUBLIC LICENSE TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION 0. This License applies to any program or other work which contains a notice placed by the copyright holder saying it may be distributed under the terms of this General Public License. The "Program", below, refers to any such program or work, and a "work based on the Program" means either the Program or any derivative work under copyright law: that is to say, a work containing the Program or a portion of it, either verbatim or with modifications and/or translated into another language. (Hereinafter, translation is included without limitation in the term "modification".) Each licensee is addressed as "you". Activities other than copying, distribution and modification are not covered by this License; they are outside its scope. The act of running the Program is not restricted, and the output from the Program is covered only if its contents constitute a work based on the Program (independent of having been made by running the Program). Whether that is true depends on what the Program does. 1. You may copy and distribute verbatim copies of the Program's source code as you receive it, in any medium, provided that you conspicuously and appropriately publish on each copy an appropriate copyright notice and disclaimer of warranty; keep intact all the notices that refer to this License and to the absence of any warranty; and give any other recipients of the Program a copy of this License along with the Program. You may charge a fee for the physical act of transferring a copy, and you may at your option offer warranty protection in exchange for a fee. 2. You may modify your copy or copies of the Program or any portion of it, thus forming a work based on the Program, and copy and distribute such modifications or work under the terms of Section 1 above, provided that you also meet all of these conditions: a) You must cause the modified files to carry prominent notices stating that you changed the files and the date of any change. b) You must cause any work that you distribute or publish, that in whole or in part contains or is derived from the Program or any part thereof, to be licensed as a whole at no charge to all third parties under the terms of this License. c) If the modified program normally reads commands interactively when run, you must cause it, when started running for such interactive use in the most ordinary way, to print or display an announcement including an appropriate copyright notice and a notice that there is no warranty (or else, saying that you provide a warranty) and that users may redistribute the program under these conditions, and telling the user how to view a copy of this License. (Exception: if the Program itself is interactive but does not normally print such an announcement, your work based on the Program is not required to print an announcement.) These requirements apply to the modified work as a whole. If identifiable sections of that work are not derived from the Program, and can be reasonably considered independent and separate works in themselves, then this License, and its terms, do not apply to those sections when you distribute them as separate works. But when you distribute the same sections as part of a whole which is a work based on the Program, the distribution of the whole must be on the terms of this License, whose permissions for other licensees extend to the entire whole, and thus to each and every part regardless of who wrote it. Thus, it is not the intent of this section to claim rights or contest your rights to work written entirely by you; rather, the intent is to exercise the right to control the distribution of derivative or collective works based on the Program. In addition, mere aggregation of another work not based on the Program with the Program (or with a work based on the Program) on a volume of a storage or distribution medium does not bring the other work under the scope of this License. 3. You may copy and distribute the Program (or a work based on it, under Section 2) in object code or executable form under the terms of Sections 1 and 2 above provided that you also do one of the following: a) Accompany it with the complete corresponding machine-readable source code, which must be distributed under the terms of Sections 1 and 2 above on a medium customarily used for software interchange; or, b) Accompany it with a written offer, valid for at least three years, to give any third party, for a charge no more than your cost of physically performing source distribution, a complete machine-readable copy of the corresponding source code, to be distributed under the terms of Sections 1 and 2 above on a medium customarily used for software interchange; or, c) Accompany it with the information you received as to the offer to distribute corresponding source code. (This alternative is allowed only for noncommercial distribution and only if you received the program in object code or executable form with such an offer, in accord with Subsection b above.) The source code for a work means the preferred form of the work for making modifications to it. For an executable work, complete source code means all the source code for all modules it contains, plus any associated interface definition files, plus the scripts used to control compilation and installation of the executable. However, as a special exception, the source code distributed need not include anything that is normally distributed (in either source or binary form) with the major components (compiler, kernel, and so on) of the operating system on which the executable runs, unless that component itself accompanies the executable. If distribution of executable or object code is made by offering access to copy from a designated place, then offering equivalent access to copy the source code from the same place counts as distribution of the source code, even though third parties are not compelled to copy the source along with the object code. 4. You may not copy, modify, sublicense, or distribute the Program except as expressly provided under this License. Any attempt otherwise to copy, modify, sublicense or distribute the Program is void, and will automatically terminate your rights under this License. However, parties who have received copies, or rights, from you under this License will not have their licenses terminated so long as such parties remain in full compliance. 5. You are not required to accept this License, since you have not signed it. However, nothing else grants you permission to modify or distribute the Program or its derivative works. These actions are prohibited by law if you do not accept this License. Therefore, by modifying or distributing the Program (or any work based on the Program), you indicate your acceptance of this License to do so, and all its terms and conditions for copying, distributing or modifying the Program or works based on it. 6. Each time you redistribute the Program (or any work based on the Program), the recipient automatically receives a license from the original licensor to copy, distribute or modify the Program subject to these terms and conditions. You may not impose any further restrictions on the recipients' exercise of the rights granted herein. You are not responsible for enforcing compliance by third parties to this License. 7. If, as a consequence of a court judgment or allegation of patent infringement or for any other reason (not limited to patent issues), conditions are imposed on you (whether by court order, agreement or otherwise) that contradict the conditions of this License, they do not excuse you from the conditions of this License. If you cannot distribute so as to satisfy simultaneously your obligations under this License and any other pertinent obligations, then as a consequence you may not distribute the Program at all. For example, if a patent license would not permit royalty-free redistribution of the Program by all those who receive copies directly or indirectly through you, then the only way you could satisfy both it and this License would be to refrain entirely from distribution of the Program. If any portion of this section is held invalid or unenforceable under any particular circumstance, the balance of the section is intended to apply and the section as a whole is intended to apply in other circumstances. It is not the purpose of this section to induce you to infringe any patents or other property right claims or to contest validity of any such claims; this section has the sole purpose of protecting the integrity of the free software distribution system, which is implemented by public license practices. Many people have made generous contributions to the wide range of software distributed through that system in reliance on consistent application of that system; it is up to the author/donor to decide if he or she is willing to distribute software through any other system and a licensee cannot impose that choice. This section is intended to make thoroughly clear what is believed to be a consequence of the rest of this License. 8. If the distribution and/or use of the Program is restricted in certain countries either by patents or by copyrighted interfaces, the original copyright holder who places the Program under this License may add an explicit geographical distribution limitation excluding those countries, so that distribution is permitted only in or among countries not thus excluded. In such case, this License incorporates the limitation as if written in the body of this License. 9. The Free Software Foundation may publish revised and/or new versions of the General Public License from time to time. Such new versions will be similar in spirit to the present version, but may differ in detail to address new problems or concerns. Each version is given a distinguishing version number. If the Program specifies a version number of this License which applies to it and "any later version", you have the option of following the terms and conditions either of that version or of any later version published by the Free Software Foundation. If the Program does not specify a version number of this License, you may choose any version ever published by the Free Software Foundation. 10. If you wish to incorporate parts of the Program into other free programs whose distribution conditions are different, write to the author to ask for permission. For software which is copyrighted by the Free Software Foundation, write to the Free Software Foundation; we sometimes make exceptions for this. Our decision will be guided by the two goals of preserving the free status of all derivatives of our free software and of promoting the sharing and reuse of software generally. NO WARRANTY 11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION. 12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. END OF TERMS AND CONDITIONS How to Apply These Terms to Your New Programs If you develop a new program, and you want it to be of the greatest possible use to the public, the best way to achieve this is to make it free software which everyone can redistribute and change under these terms. To do so, attach the following notices to the program. It is safest to attach them to the start of each source file to most effectively convey the exclusion of warranty; and each file should have at least the "copyright" line and a pointer to where the full notice is found. Copyright (C) This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA Also add information on how to contact you by electronic and paper mail. If the program is interactive, make it output a short notice like this when it starts in an interactive mode: Gnomovision version 69, Copyright (C) year name of author Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'. This is free software, and you are welcome to redistribute it under certain conditions; type `show c' for details. The hypothetical commands `show w' and `show c' should show the appropriate parts of the General Public License. Of course, the commands you use may be called something other than `show w' and `show c'; they could even be mouse-clicks or menu items--whatever suits your program. You should also get your employer (if you work as a programmer) or your school, if any, to sign a "copyright disclaimer" for the program, if necessary. Here is a sample; alter the names: Yoyodyne, Inc., hereby disclaims all copyright interest in the program `Gnomovision' (which makes passes at compilers) written by James Hacker. , 1 April 1989 Ty Coon, President of Vice This General Public License does not permit incorporating your program into proprietary programs. If your program is a subroutine library, you may consider it more useful to permit linking proprietary applications with the library. If this is what you want to do, use the GNU Library General Public License instead of this License. This is at least a partial credits-file of people that have contributed to the Linux project. It is sorted by name and formatted to allow easy grepping and beautification by scripts. The fields are: name (N), email (E), web-address (W), PGP key ID and fingerprint (P), description (D), and snail-mail address (S). Thanks, Linus ---------- N: Matti Aarnio E: mea@nic.funet.fi D: Alpha systems hacking, IPv6 and other network related stuff D: One of assisting postmasters for vger.kernel.org's lists S: (ask for current address) S: Finland N: Dragos Acostachioaie E: dragos@iname.com W: http://www.arbornet.org/~dragos D: /proc/sysvipc S: C. Negri 6, bl. D3 S: Iasi 6600 S: Romania N: Mark Adler E: madler@alumni.caltech.edu W: http://alumnus.caltech.edu/~madler/ D: zlib decompression N: Monalisa Agrawal E: magrawal@nortelnetworks.com D: Basic Interphase 5575 driver with UBR and ABR support. S: 75 Donald St, Apt 42 S: Weymouth, MA 02188 S: USA N: Dave Airlie E: airlied@linux.ie W: http://www.csn.ul.ie/~airlied D: NFS over TCP patches D: in-kernel DRM Maintainer S: Longford, Ireland S: Sydney, Australia N: Tigran A. Aivazian E: tigran@aivazian.fsnet.co.uk W: http://www.moses.uklinux.net/patches D: BFS filesystem D: Intel IA32 CPU microcode update support D: Various kernel patches S: United Kingdom N: Werner Almesberger E: werner@almesberger.net W: http://www.almesberger.net/ D: dosfs, LILO, some fd features, ATM, various other hacks here and there S: Buenos Aires S: Argentina N: Tim Alpaerts E: tim_alpaerts@toyota-motor-europe.com D: 802.2 class II logical link control layer, D: the humble start of an opening towards the IBM SNA protocols S: Klaproosstraat 72 c 10 S: B-2610 Wilrijk-Antwerpen S: Belgium N: Anton Altaparmakov E: aia21@cantab.net W: http://www-stu.christs.cam.ac.uk/~aia21/ D: Author of new NTFS driver, various other kernel hacks. S: Christ's College S: Cambridge CB2 3BU S: United Kingdom N: C. Scott Ananian E: cananian@alumni.princeton.edu W: http://www.pdos.lcs.mit.edu/~cananian P: 1024/85AD9EED AD C0 49 08 91 67 DF D7 FA 04 1A EE 09 E8 44 B0 D: Unix98 pty support. D: APM update to 1.2 spec. D: /devfs hacking. S: 7 Kiwi Loop S: Howell, NJ 07731 S: USA N: Erik Andersen E: andersen@codepoet.org W: http://www.codepoet.org/ P: 1024D/30D39057 1BC4 2742 E885 E4DE 9301 0C82 5F9B 643E 30D3 9057 D: Maintainer of ide-cd and Uniform CD-ROM driver, D: ATAPI CD-Changer support, Major 2.1.x CD-ROM update. S: 352 North 525 East S: Springville, Utah 84663 S: USA N: Michael Ang E: mang@subcarrier.org W: http://www.subcarrier.org/mang D: Linux/PA-RISC hacker S: 85 Frank St. S: Ottawa, Ontario S: Canada K2P 0X3 N: H. Peter Anvin E: hpa@zytor.com W: http://www.zytor.com/~hpa/ P: 2047/2A960705 BA 03 D3 2C 14 A8 A8 BD 1E DF FE 69 EE 35 BD 74 D: Author of the SYSLINUX boot loader, maintainer of the linux.* news D: hierarchy and the Linux Device List; various kernel hacks S: 4390 Albany Drive #46 S: San Jose, California 95129 S: USA N: Andrea Arcangeli E: andrea@suse.de W: http://www.kernel.org/pub/linux/kernel/people/andrea/ P: 1024D/68B9CB43 13D9 8355 295F 4823 7C49 C012 DFA1 686E 68B9 CB43 P: 1024R/CB4660B9 CC A0 71 81 F4 A0 63 AC C0 4B 81 1D 8C 15 C8 E5 D: Parport hacker D: Implemented a workaround for some interrupt buggy printers D: Author of pscan that helps to fix lp/parport bugs D: Author of lil (Linux Interrupt Latency benchmark) D: Fixed the shm swap deallocation at swapoff time (try_to_unuse message) D: VM hacker D: Various other kernel hacks S: Imola 40026 S: Italy N: Derek Atkins E: warlord@MIT.EDU D: Linux-AFS Port, random kernel hacker, D: VFS fixes (new notify_change in particular) D: Moving all VFS access checks into the file systems S: MIT Room E15-341 S: 20 Ames Street S: Cambridge, Massachusetts 02139 S: USA N: Michel Aubry E: giovanni D: Aladdin 1533/1543(C) chipset IDE D: VIA MVP-3/TX Pro III chipset IDE N: Jens Axboe E: axboe@suse.de D: Linux CD-ROM maintainer, DVD support D: elevator + block layer rewrites D: highmem I/O support D: misc hacking on IDE, SCSI, block drivers, etc S: Peter Bangs Vej 258, 2TH S: 2500 Valby S: Denmark N: John Aycock E: aycock@cpsc.ucalgary.ca D: Adaptec 274x driver S: Department of Computer Science S: University of Calgary S: Calgary, Alberta S: Canada N: Miles Bader E: miles@gnu.org D: v850 port (uClinux) S: NEC Corporation S: 1753 Shimonumabe, Nakahara-ku S: Kawasaki 211-8666 S: Japan N: Ralf Baechle E: ralf@gnu.org P: 1024/AF7B30C1 CF 97 C2 CC 6D AE A7 FE C8 BA 9C FC 88 DE 32 C3 D: Linux/MIPS port D: Linux/68k hacker S: Hauptstrasse 19 S: 79837 St. Blasien S: Germany N: Krishna Balasubramanian E: balasub@cis.ohio-state.edu D: Wrote SYS V IPC (part of standard kernel since 0.99.10) N: Dario Ballabio E: ballabio_dario@emc.com E: dario.ballabio@tiscalinet.it E: dario.ballabio@inwind.it D: Author and maintainer of the Ultrastor 14F/34F SCSI driver D: Author and maintainer of the EATA ISA/EISA/PCI SCSI driver S: EMC Corporation S: Milano S: Italy N: Paul Bame E: bame@debian.org E: bame@puffin.external.hp.com E: paul_bame@hp.com W: http://www.parisc-linux.org D: PA-RISC 32 and 64-bit early boot, firmware interface, interrupts, misc S: MS42 S: Hewlett-Packard S: 3404 E Harmony Rd S: Fort Collins, CO 80525 S: USA N: Arindam Banerji E: axb@cse.nd.edu D: Contributed ESDI driver routines needed to port LINUX to the PS/2 MCA. S: Department of Computer Science & Eng. S: University of Notre Dame S: Notre Dame, Indiana S: USA N: Greg Banks E: gnb@alphalink.com.au D: IDT77105 ATM network driver D: some SuperH port work D: some trivial futzing with kconfig N: James Banks E: james@sovereign.org D: TLAN network driver D: Logitech Busmouse driver N: Krzysztof G. Baranowski E: kgb@manjak.knm.org.pl P: 1024/FA6F16D1 96 D1 1A CF 5F CA 69 EC F9 4F 36 1F 6D 60 7B DA D: Maintainer of the System V file system. D: System V fs update for 2.1.x dcache. D: Forward ported a couple of SCSI drivers. D: Various bugfixes. S: ul. Koscielna 12a S: 62-300 Wrzesnia S: Poland N: Fred Barnes E: frmb2@ukc.ac.uk D: Various parport/ppdev hacks and fixes S: Computing Lab, The University S: Canterbury, KENT S: CT2 7NF S: England N: Paul Barton-Davis E: pbd@op.net D: Driver for WaveFront soundcards (Turtle Beach Maui, Tropez, Tropez+) D: Various bugfixes and changes to sound drivers S: USA N: Carlos Henrique Bauer E: chbauer@acm.org E: bauer@atlas.unisinos.br D: Some new sysctl entries for the parport driver. D: New sysctl function for handling unsigned longs S: Universidade do Vale do Rio dos Sinos - UNISINOS S: DSI/IDASI S: Av. Unisinos, 950 S: 93022000 Sao Leopoldo RS S: Brazil N: Peter Bauer E: 100136.3530@compuserve.com D: Driver for depca-ethernet-board S: 69259 Wilhemsfeld S: Rainweg 15 S: Germany N: Fred Baumgarten E: dc6iq@insl1.etec.uni-karlsruhe.de E: dc6iq@adacom.org E: dc6iq@db0ais.#hes.deu.eu (packet radio) D: NET-2 & netstat(8) S: Soevener Strasse 11 S: 53773 Hennef S: Germany N: Donald Becker E: becker@cesdis.gsfc.nasa.gov D: General low-level networking hacker D: Most of the ethercard drivers D: Original author of the NFS server S: USRA Center of Excellence in Space Data and Information Sciences S: Code 930.5, Goddard Space Flight Center S: Greenbelt, Maryland 20771 S: USA N: Adam Belay E: ambx1@neo.rr.com D: Linux Plug and Play Support S: USA N: Daniele Bellucci E: bellucda@tiscali.it D: Various Janitor work. W: http://web.tiscali.it/bellucda S: Via Delle Palme, 9 S: Terni 05100 S: Italy N: Krzysztof Benedyczak E: golbi@mat.uni.torun.pl W: http://www.mat.uni.torun.pl/~golbi D: POSIX message queues fs (with M. Wronski) S: ul. Podmiejska 52 S: Radunica S: 83-000 Pruszcz Gdanski S: Poland N: Randolph Bentson E: bentson@grieg.seaslug.org W: http://www.aa.net/~bentson/ P: 1024/39ED5729 5C A8 7A F4 B2 7A D1 3E B5 3B 81 CF 47 30 11 71 D: Author of driver for Cyclom-Y and Cyclades-Z async mux S: 2322 37th Ave SW S: Seattle, Washington 98126-2010 S: USA N: Muli Ben-Yehuda E: mulix@mulix.org E: muli@il.ibm.com W: http://www.mulix.org D: trident OSS sound driver, x86-64 dma-ops and Calgary IOMMU, D: KVM and Xen bits and other misc. hackery. S: Haifa, Israel N: Johannes Berg E: johannes@sipsolutions.net W: http://johannes.sipsolutions.net/ P: 4096R/7BF9099A C0EB C440 F6DA 091C 884D 8532 E0F3 73F3 7BF9 099A D: powerpc & 802.11 hacker N: Stephen R. van den Berg (AKA BuGless) E: berg@pool.informatik.rwth-aachen.de D: General kernel, gcc, and libc hacker D: Specialisation: tweaking, ensuring portability, tweaking, cleaning, D: tweaking and occasionally debugging :-) S: Bouwensstraat 22 S: 6369 BG Simpelveld S: The Netherlands N: Peter Berger E: pberger@brimson.com W: http://www.brimson.com D: Author/maintainer of Digi AccelePort USB driver S: 1549 Hiironen Rd. S: Brimson, MN 55602 S: USA N: Hennus Bergman P: 1024/77D50909 76 99 FD 31 91 E1 96 1C 90 BB 22 80 62 F6 BD 63 D: Author and maintainer of the QIC-02 tape driver S: The Netherlands N: Tomas Berndtsson E: tomas@nocrew.org W: http://tomas.nocrew.org/ D: dsp56k device driver N: Ross Biro E: ross.biro@gmail.com D: Original author of the Linux networking code N: Anton Blanchard E: anton@samba.org W: http://samba.org/~anton/ P: 1024/8462A731 4C 55 86 34 44 59 A7 99 2B 97 88 4A 88 9A 0D 97 D: sun4 port, Sparc hacker N: Hugh Blemings E: hugh@blemings.org W: http://blemings.org/hugh D: Original author of the Keyspan USB to serial drivers, random PowerPC hacker S: PO Box 234 S: Belconnen ACT 2616 S: Australia N: Philip Blundell E: philb@gnu.org D: Linux/ARM hacker D: Device driver hacker (eexpress, 3c505, c-qcam, ...) D: m68k port to HP9000/300 D: AUN network protocols D: Co-architect of the parallel port sharing system D: IPv6 netfilter S: FutureTV Labs Ltd S: Brunswick House, 61-69 Newmarket Rd, Cambridge CB5 8EG S: United Kingdom N: Thomas Bogendorfer E: tsbogend@alpha.franken.de D: PCnet32 driver, SONIC driver, JAZZ_ESP driver D: newport abscon driver, g364 framebuffer driver D: strace for Linux/Alpha D: Linux/MIPS hacker S: Schafhofstr. 40 S: 90556 Cadolzburg S: Germany N: Bill Bogstad E: bogstad@pobox.com D: wrote /proc/self hack, minor samba & dosemu patches N: Axel Boldt E: axel@uni-paderborn.de W: http://math-www.uni-paderborn.de/~axel/ D: Configuration help text support D: Linux CD and Support Giveaway List N: Erik Inge Bols©ª E: knan@mo.himolde.no D: Misc kernel hacks D: Updated PC speaker driver for 2.3 S: Norway N: Andreas E. Bombe E: andreas.bombe@munich.netsurf.de W: http://home.pages.de/~andreas.bombe/ P: 1024/04880A44 72E5 7031 4414 2EB6 F6B4 4CBD 1181 7032 0488 0A44 D: IEEE 1394 subsystem rewrite and maintainer D: Texas Instruments PCILynx IEEE 1394 driver N: Al Borchers E: alborchers@steinerpoint.com D: Author/maintainer of Digi AccelePort USB driver D: work on usbserial and keyspan_pda drivers S: 4912 Zenith Ave. S. S: Minneapolis, MN 55410 S: USA N: Marc Boucher E: marc@mbsi.ca P: CA 67 A5 1A 38 CE B6 F2 D5 83 51 03 D2 9C 30 9E CE D2 DD 65 D: Netfilter core D: IP policy routing by mark D: Various fixes (mostly networking) S: Montreal, Quebec S: Canada N: Zoltan Boszormenyi E: zboszor@mail.externet.hu D: MTRR emulation with Cyrix style ARR registers, Athlon MTRR support N: John Boyd E: boyd@cis.ohio-state.edu D: Co-author of wd7000 SCSI driver S: 101 Curl Drive #591 S: Columbus, Ohio 43210 S: USA N: Peter Braam E: braam@clusterfs.com W: http://www.clusterfs.com/ D: Coda & InterMezzo filesystems S: 181 McNeil S: Canmore, AB S: Canada, T1W 2R9 N: Ryan Bradetich E: rbradetich@uswest.net D: Linux/PA-RISC hacker S: 1200 Goldenrod Dr. S: Nampa, Idaho 83686 S: USA N: Dirk J. Brandewie E: dirk.j.brandewie@intel.com E: linux-wimax@intel.com D: Intel Wireless WiMAX Connection 2400 SDIO driver N: Derrick J. Brashear E: shadow@dementia.org W: http://www.dementia.org/~shadow P: 512/71EC9367 C5 29 0F BC 83 51 B9 F0 BC 05 89 A0 4F 1F 30 05 D: Author of Sparc CS4231 audio driver, random Sparc work S: 403 Gilmore Avenue S: Trafford, Pennsylvania 15085 S: USA N: Dag Brattli E: dagb@cs.uit.no W: http://www.cs.uit.no/~dagb D: IrDA Subsystem S: 19. Wellington Road S: Lancaster, LA1 4DN S: UK, England N: Lars Brinkhoff E: lars@nocrew.org W: http://lars.nocrew.org/ D: dsp56k device driver D: ptrace proxy in user mode kernel port S: Kopmansg 2 S: 411 13 Goteborg S: Sweden N: Paul Bristow E: paul@paulbristow.net W: http://paulbristow.net/linux/idefloppy.html D: Maintainer of IDE/ATAPI floppy driver N: Dominik Brodowski E: linux@brodo.de W: http://www.brodo.de/ P: 1024D/725B37C6 190F 3E77 9C89 3B6D BECD 46EE 67C3 0308 725B 37C6 D: parts of CPUFreq code, ACPI bugfixes, PCMCIA rewrite, cpufrequtils S: Tuebingen, Germany N: Andries Brouwer E: aeb@cwi.nl D: random Linux hacker S: Bessemerstraat 21 S: Amsterdam S: The Netherlands N: NeilBrown E: neil@brown.name P: 4096R/566281B9 1BC6 29EB D390 D870 7B5F 497A 39EC 9EDD 5662 81B9 D: NFSD Maintainer 2000-2007 N: Zach Brown E: zab@zabbo.net D: maestro pci sound N: David Brownell D: Kernel engineer, mentor, and friend. Maintained USB EHCI and D: gadget layers, SPI subsystem, GPIO subsystem, and more than a few D: device drivers. His encouragement also helped many engineers get D: started working on the Linux kernel. David passed away in early D: 2011, and will be greatly missed. W: https://lkml.org/lkml/2011/4/5/36 N: Gary Brubaker E: xavyer@ix.netcom.com D: USB Serial Empeg Empeg-car Mark I/II Driver N: Matthias Bruestle E: m@mbsks.franken.de D: REINER SCT cyberJack pinpad/e-com USB chipcard reader driver S: Germany N: Adrian Bunk P: 1024D/4F12B400 B29C E71E FE19 6755 5C8A 84D4 99FC EA98 4F12 B400 D: misc kernel hacking and testing N: Ray Burr E: ryb@nightmare.com D: Original author of Amiga FFS filesystem S: Orlando, Florida S: USA N: Lennert Buytenhek E: kernel@wantstofly.org D: Original (2.4) rewrite of the ethernet bridging code D: Various ARM bits and pieces S: Ravenhorst 58 S: 2317 AK Leiden S: The Netherlands N: Michael Callahan E: callahan@maths.ox.ac.uk D: PPP for Linux S: The Mathematical Institute S: 25-29 St Giles S: Oxford S: United Kingdom N: Luiz Fernando N. Capitulino E: lcapitulino@mandriva.com.br E: lcapitulino@gmail.com W: http://www.cpu.eti.br D: misc kernel hacking S: Mandriva S: Brazil N: Remy Card E: Remy.Card@masi.ibp.fr E: Remy.Card@linux.org D: Extended file system [defunct] designer and developer D: Second extended file system designer and developer S: Institut Blaise Pascal S: 4 Place Jussieu S: 75252 Paris Cedex 05 S: France N: Ulf Carlsson D: SGI Indy audio (HAL2) drivers E: ulfc@bun.falkenberg.se N: Ed Carp E: ecarp@netcom.com D: uucp, elm, pine, pico port D: cron, at(1) developer S: 48287 Sawleaf S: Fremont, California 94539 S: USA N: Florent Chabaud E: florent.chabaud@polytechnique.org D: software suspend S: SGDN/DCSSI/SDS/LTI S: 58, Bd Latour-Maubourg S: 75700 Paris 07 SP S: France N: Gordon Chaffee E: chaffee@cs.berkeley.edu W: http://bmrc.berkeley.edu/people/chaffee/ D: vfat, fat32, joliet, native language support S: 3700 Warwick Road S: Fremont, California 94555 S: USA N: Chih-Jen Chang E: chihjenc@scf.usc.edu E: chihjen@iis.sinica.edu.tw D: IGMP(Internet Group Management Protocol) version 2 S: 3F, 65 Tajen street S: Tamsui town, Taipei county, S: Taiwan 251 S: Republic of China N: Reinette Chatre E: reinette.chatre@intel.com D: WiMedia Link Protocol implementation D: UWB stack bits and pieces N: Michael Elizabeth Chastain E: mec@shout.net D: Configure, Menuconfig, xconfig N: Raymond Chen E: raymondc@microsoft.com D: Author of Configure script S: 14509 NE 39th Street #1096 S: Bellevue, Washington 98007 S: USA N: Christopher L. Cheney E: ccheney@debian.org E: ccheney@cheney.cx W: http://www.cheney.cx P: 1024D/8E384AF2 2D31 1927 87D7 1F24 9FF9 1BC5 D106 5AB3 8E38 4AF2 D: Vista Imaging usb webcam driver S: 314 Prince of Wales S: Conroe, TX 77304 S: USA N: Stuart Cheshire E: cheshire@cs.stanford.edu D: Author of Starmode Radio IP (STRIP) driver D: Originator of design for new combined interrupt handlers S: William Gates Department S: Stanford University S: Stanford, California 94305 S: USA N: Randolph Chung E: tausq@debian.org D: Linux/PA-RISC hacker S: Hong Kong N: Juan Jose Ciarlante W: http://juanjox.kernelnotes.org/ E: jjciarla@raiz.uncu.edu.ar E: jjo@mendoza.gov.ar D: Network driver alias support D: IP masq hashing and app modules D: IP masq 2.1 features and bugs S: Las Cuevas 2385 - Bo Guemes S: Las Heras, Mendoza CP 5539 S: Argentina N: Steven P. Cole E: scole@lanl.gov E: elenstev@mesatop.com D: Various build fixes and kernel documentation. S: Los Alamos, New Mexico S: USA N: Hamish Coleman E: hamish@zot.apana.org.au D: SEEQ8005 network driver S: 98 Paxton Street S: East Malvern, Victoria, 3145 S: Australia N: Neil Conway E: nconway.list@ukaea.org.uk D: Assorted sched/mm titbits S: Oxfordshire, UK. N: Kees Cook E: kees@outflux.net E: kees@ubuntu.com E: keescook@chromium.org W: http://outflux.net/blog/ P: 4096R/DC6DC026 A5C3 F68F 229D D60F 723E 6E13 8972 F4DF DC6D C026 D: Various security things, bug fixes, and documentation. S: (ask for current address) S: Portland, Oregon S: USA N: Robin Cornelius E: robincornelius@users.sourceforge.net D: Ralink rt2x00 WLAN driver S: Cornwall, U.K. N: Mark Corner E: mcorner@umich.edu W: http://www.eecs.umich.edu/~mcorner/ D: USB Bluetooth Driver S: University of Michigan S: Ann Arbor, MI N: Michael Cornwell E: cornwell@acm.org D: Original designer and co-author of ATA Taskfile D: Kernel module SMART utilities S: Santa Cruz, California S: USA N: Luis Correia E: lfcorreia@users.sf.net D: Ralink rt2x00 WLAN driver S: Belas, Portugal N: Alan Cox W: http://www.linux.org.uk/diary/ D: Linux Networking (0.99.10->2.0.29) D: Original Appletalk, AX.25, and IPX code D: 3c501 hacker D: Watchdog timer drivers D: Linux/SMP x86 (up to 2.0 only) D: Initial Mac68K port D: Video4Linux design, bw-qcam and PMS driver ports. D: IDE modularisation work D: Z85230 driver D: Former security contact point (please use vendor-sec@lst.de) D: ex 2.2 maintainer D: 2.1.x modular sound S: c/o Red Hat UK Ltd S: Alexandra House S: Alexandra Terrace S: Guildford, GU1 3DA S: United Kingdom N: Cristian Mihail Craciunescu W: http://www.dnt.ro/~cristi/ E: cristi@dnt.ro D: Support for Xircom PGSDB9 (firmware and host driver) S: Bucharest S: Romania N: Laurence Culhane E: loz@holmes.demon.co.uk D: Wrote the initial alpha SLIP code S: 81 Hood Street S: Northampton S: NN1 3QT S: United Kingdom N: Uwe Dannowski E: Uwe.Dannowski@ira.uka.de W: http://i30www.ira.uka.de/~dannowsk/ D: FORE PCA-200E driver S: University of Karlsruhe S: Germany N: Ray Dassen E: jdassen@wi.LeidenUniv.nl W: http://www.wi.leidenuniv.nl/~jdassen/ P: 1024/672D05C1 DD 60 32 60 F7 90 64 80 E7 6F D4 E4 F8 C9 4A 58 D: Debian GNU/Linux: www.debian.org maintainer, FAQ co-maintainer, D: packages testing, nit-picking & fixing. Enjoying BugFree (TM) kernels. S: Zuidsingel 10A S: 2312 SB Leiden S: The Netherlands N: David Davies E: davies@wanton.lkg.dec.com D: Network driver author - depca, ewrk3 and de4x5 D: Wrote shared interrupt support S: Digital Equipment Corporation S: 550 King Street S: Littleton, Massachusetts 01460 S: USA N: Frank Davis E: fdavis@si.rr.com E: fdavis112@juno.com D: Various kernel patches S: 8 Lakeview Terr. S: Kerhonkson, NY 12446 S: USA N: Wayne Davison E: davison@borland.com D: Second extended file system co-designer N: Terry Dawson E: terry@perf.no.itg.telecom.com.au E: terry@albert.vk2ktj.ampr.org (Amateur Radio use only) D: trivial hack to add variable address length routing to Rose. D: AX25-HOWTO, HAM-HOWTO, IPX-HOWTO, NET-2-HOWTO D: ax25-utils maintainer. N: Helge Deller E: deller@gmx.de E: hdeller@redhat.de D: PA-RISC Linux hacker, LASI-, ASP-, WAX-, LCD/LED-driver S: Schimmelsrain 1 S: D-69231 Rauenberg S: Germany N: Jean Delvare E: khali@linux-fr.org W: http://khali.linux-fr.org/ D: Several hardware monitoring drivers S: France N: Peter Denison E: peterd@pnd-pc.demon.co.uk W: http://www.pnd-pc.demon.co.uk/promise/ D: Promise DC4030VL caching HD controller drivers N: Todd J. Derr E: tjd@fore.com W: http://www.wordsmith.org/~tjd D: Random console hacks and other miscellaneous stuff S: 3000 FORE Drive S: Warrendale, Pennsylvania 15086 S: USA N: Martin Devera E: devik@cdi.cz W: http://luxik.cdi.cz/~devik/qos/ D: HTB qdisc and random networking hacks N: Alex deVries E: alex@onefishtwo.ca D: Various SGI parts, bits of HAL2 and Newport, PA-RISC Linux. S: 41.5 William Street S: Ottawa, Ontario S: K1N 6Z9 S: CANADA N: Jeff Dike E: jdike@karaya.com W: http://user-mode-linux.sourceforge.net D: User mode kernel port S: 375 Tubbs Hill Rd S: Deering NH 03244 S: USA N: Matt Domsch E: Matt_Domsch@dell.com W: http://www.dell.com/linux W: http://domsch.com/linux D: Linux/IA-64 D: Dell PowerEdge server, SCSI layer, misc drivers, and other patches S: Dell Inc. S: One Dell Way S: Round Rock, TX 78682 S: USA N: Mattia Dongili E: malattia@gmail.com D: cpufrequtils (precursor to cpupowerutils) N: Ben Dooks E: ben-linux@fluff.org E: ben@simtec.co.uk W: http://www.fluff.org/ben/ W: http://www.simtec.co.uk/ D: Samsung S3C2410/S3C2440 support, general ARM support D: Maintaining Simtec Electronics development boards S: Simtec Electronics S: Avondale Drive S: Tarleton S: Preston S: Lancs S: PR4 6AX S: United Kingdom N: Ivo van Doorn E: IvDoorn@gmail.com W: http://www.mendiosus.nl D: Ralink rt2x00 WLAN driver S: Haarlem, The Netherlands N: John G Dorsey E: john+@cs.cmu.edu D: ARM Linux ports to Assabet/Neponset, Spot S: Department of Electrical and Computer Engineering S: Carnegie Mellon University S: Pittsburgh, PA 15213 S: USA N: Eddie C. Dost E: ecd@skynet.be D: Linux/Sparc kernel hacker D: Linux/Sparc maintainer S: Rue de la Chapelle 51 S: 4850 Moresnet S: Belgium N: Cort Dougan E: cort@fsmlabs.com W: http://www.fsmlabs.com/linuxppcbk.html D: PowerPC N: Daniel Drake E: dsd@gentoo.org D: USBAT02 CompactFlash support in usb-storage S: UK N: Oleg Drokin E: green@ccssu.crimea.ua W: http://www.ccssu.crimea.ua/~green D: Cleaning up sound drivers, SA1100 Watchdog. S: Skvoznoy per., 14a S: Evpatoria S: Crimea S: UKRAINE, 334320 N: Walt Drummond E: drummond@valinux.com D: Linux/IA-64 S: 1382 Bordeaux Drive S: Sunnyvale, CA 94087 S: USA N: Bruno Ducrot E: ducrot@poupinou.org D: CPUFreq and ACPI bugfixes. S: Mougin, France N: Don Dugger E: n0ano@valinux.com D: Linux/IA-64 S: 1209 Pearl Street, #12 S: Boulder, CO 80302 S: USA N: Thomas Dunbar E: tdunbar@vt.edu D: TeX & METAFONT hacking/maintenance S: Virginia Tech Computing Center S: 1700 Pratt Drive S: Blacksburg, Virginia 24061 S: USA N: Randy Dunlap E: rdunlap@xenotime.net W: http://www.xenotime.net/linux/linux.html W: http://www.linux-usb.org D: Linux-USB subsystem, USB core/UHCI/printer/storage drivers D: x86 SMP, ACPI, bootflag hacking S: (ask for current address) S: USA N: Bob Dunlop E: rjd@xyzzy.clara.co.uk E: bob.dunlop@farsite.co.uk W: www.farsite.co.uk D: FarSync card device driver S: FarSite Communications Ltd S: Tempus Business Centre S: 60 Kingsclere Road S: Basingstoke RG21 6XG S: UK N: Cyrus Durgin E: cider@speakeasy.org W: http://www.speakeasy.org/~cider/ D: implemented kmod N: Torsten Duwe E: Torsten.Duwe@informatik.uni-erlangen.de D: Part-time kernel hacker D: The Linux Support Team Erlangen S: Grevenbroicher Str. 17 S: 47807 Krefeld S: Germany N: Tom Dyas E: tdyas@eden.rutgers.edu D: minor hacks and some sparc port stuff S: New Jersey S: USA N: Drew Eckhardt E: drew@PoohSticks.ORG D: SCSI code D: Assorted snippets elsewhere D: Boot sector "..." printing S: 2037 Walnut #6 S: Boulder, Colorado 80302 S: USA N: Heiko Ei©¬feldt E: heiko@colossus.escape.de heiko@unifix.de D: verify_area stuff, generic SCSI fixes D: SCSI Programming HOWTO D: POSIX.1 compliance testing S: Unifix Software GmbH S: Bueltenweg 27a S: D-38106 Braunschweig S: Germany N: Bjorn Ekwall E: bj0rn@blox.se W: http://www.pi.se/blox/ D: Extended support for loadable modules D: D-Link pocket adapter drivers S: Brevia 1043 S: S-114 79 Stockholm S: Sweden N: Pekka Enberg E: penberg@cs.helsinki.fi W: http://www.cs.helsinki.fi/u/penberg/ D: Various kernel hacks, fixes, and cleanups. D: Slab allocators S: Finland N: David Engebretsen E: engebret@us.ibm.com D: Linux port to 64-bit PowerPC architecture N: Michael Engel E: engel@unix-ag.org D: DECstation framebuffer drivers S: Germany N: Paal-Kristian Engstad E: engstad@intermetrics.com D: Kernel smbfs (to mount WfW, NT and OS/2 network drives.) S: 17101 Springdale Street #225 S: Huntington Beach, California 92649 S: USA N: Stephane Eranian E: eranian@hpl.hp.com D: Linux/ia64 S: 1501 Page Mill Rd, MS 1U17 S: Palo Alto, CA 94304 S: USA N: Johannes Erdfelt E: johannes@erdfelt.com D: Linux/IA-64 bootloader and kernel goop, USB S: 6350 Stoneridge Mall Road S: Pleasanton, CA 94588 S: USA N: Doug Evans E: dje@cygnus.com D: Wrote Xenix FS (part of standard kernel since 0.99.15) N: Riccardo Facchetti E: fizban@tin.it P: 1024/6E657BB5 AF 22 90 33 78 76 04 8B AF F9 97 1E B5 E2 65 30 D: Audio Excel DSP 16 init driver author D: libmodem author D: Yet Another Micro Monitor port and current maintainer D: First ELF-HOWTO author D: random kernel hacker S: Via Paolo VI n.29 S: 23900 - LECCO (Lc) S: Italy N: Nils Faerber E: nils@kernelconcepts.de D: i810 TCO watchdog driver author D: Mitsumi LU005 tests and fixes D: port and fixes of cs46xx sounddriver S: Dreisbachstrasse 24 S: D-57250 Netphen S: Germany N: Rik Faith E: faith@acm.org D: Future Domain TMC-16x0 SCSI driver (author) D: APM driver (early port) D: DRM drivers (author of several) N: Janos Farkas E: chexum@shadow.banki.hu D: romfs, various (mostly networking) fixes P: 1024/F81FB2E1 41 B7 E4 E6 3E D4 A6 71 6D 9C F3 9F F2 BF DF 6E S: Madarasz Viktor utca 25 S: 1131 Budapest S: Hungary N: Ben Fennema E: bfennema@falcon.csc.calpoly.edu W: http://www.csc.calpoly.edu/~bfennema D: UDF filesystem S: (ask for current address) S: USA N: Jurgen Fischer E: fischer@norbit.de D: Author of Adaptec AHA-152x SCSI driver S: Schulstra©¬e 18 S: 26506 Norden S: Germany N: Jeremy Fitzhardinge E: jeremy@goop.org W: http://www.goop.org/~jeremy D: author of userfs filesystem D: Improved mmap and munmap handling D: General mm minor tidyups D: autofs v4 maintainer S: 987 Alabama St S: San Francisco S: CA, 94110 S: USA N: Ralf Flaxa E: rfflaxa@immd4.informatik.uni-erlangen.de D: The Linux Support Team Erlangen D: Creator of LST distribution D: Author of installation tool LISA S: Pfitznerweg 6 S: 74523 Schwaebisch Hall S: Germany N: Lawrence Foard E: entropy@world.std.com D: Floppy track reading, fs code S: 217 Park Avenue, Suite 108 S: Worcester, Massachusetts 01609 S: USA N: Karl Fogel E: kfogel@cs.oberlin.edu D: Contributor, Linux User's Guide S: 1123 North Oak Park Avenue S: Oak Park, Illinois 60302 S: USA N: Daniel J. Frasnelli E: dfrasnel@alphalinux.org W: http://www.alphalinux.org/ P: 1024/3EF87611 B9 F1 44 50 D3 E8 C2 80 DA E5 55 AA 56 7C 42 DA D: DEC Alpha hacker D: Miscellaneous bug squisher N: Jim Freeman E: jfree@sovereign.org W: http://www.sovereign.org/ D: Initial GPL'd Frame Relay driver D: Dynamic PPP devices D: Sundry modularizations (PPP, IPX, ...) and fixes N: Bob Frey E: bobf@advansys.com D: AdvanSys SCSI driver S: 1150 Ringwood Court S: San Jose, California 95131 S: USA N: Adam Fritzler E: mid@zigamorph.net N: Fernando Fuganti E: fuganti@conectiva.com.br E: fuganti@netbank.com.br D: random kernel hacker, ZF MachZ Watchdog driver S: Conectiva S.A. S: R. Tocantins, 89 - Cristo Rei S: 80050-430 - Curitiba - Parana S: Brazil N: Kumar Gala E: galak@kernel.crashing.org D: Embedded PowerPC 6xx/7xx/74xx/82xx/83xx/85xx support S: Austin, Texas 78729 S: USA N: Nigel Gamble E: nigel@nrg.org D: Interrupt-driven printer driver D: Preemptible kernel S: 120 Alley Way S: Mountain View, California 94040 S: USA N: Jeff Garzik E: jgarzik@pobox.com N: Jacques Gelinas E: jacques@solucorp.qc.ca D: Author of the Umsdos file system S: 1326 De Val-Brillant S: Laval, Quebec S: Canada H7Y 1V9 N: David Gentzel E: gentzel@telerama.lm.com D: Original BusLogic driver and original UltraStor driver S: Whitfield Software Services S: 600 North Bell Avenue, Suite 160 S: Carnegie, Pennsylvania 15106-4304 S: USA N: Kai Germaschewski E: kai@germaschewski.name D: Major kbuild rework during the 2.5 cycle D: ISDN Maintainer S: USA N: Philip Gladstone E: philip@gladstonefamily.net D: Kernel / timekeeping stuff S: Carlisle, MA 01741 S: USA N: Jan-Benedict Glaw E: jbglaw@lug-owl.de D: SRM environment driver (for Alpha systems) P: 1024D/8399E1BB 250D 3BCF 7127 0D8C A444 A961 1DBD 5E75 8399 E1BB N: Thomas Gleixner E: tglx@linutronix.de D: NAND flash hardware support, JFFS2 on NAND flash N: Richard E. Gooch E: rgooch@atnf.csiro.au D: parent process death signal to children D: prctl() syscall D: /proc/mtrr support to manipulate MTRRs on Intel P6 family D: Device FileSystem (devfs) S: CSIRO Australia Telescope National Facility S: P.O. Box 76, Epping S: New South Wales, 2121 S: Australia N: Carlos E. Gorges E: carlos@techlinux.com.br D: fix smp support on cmpci driver P: 2048G/EA3C4B19 FF31 33A6 0362 4915 B7EB E541 17D0 0379 EA3C 4B19 S: Brazil N: Dmitry S. Gorodchanin E: pgmdsg@ibi.com D: RISCom/8 driver, misc kernel fixes. S: 4 Main Street S: Woodbridge, Connecticut 06525 S: USA N: Paul Gortmaker E: p_gortmaker@yahoo.com D: Author of RTC driver & several net drivers, Ethernet & BootPrompt Howto. D: Made support for modules, ramdisk, generic-serial, etc. optional. D: Transformed old user space bdflush into 1st kernel thread - kflushd. D: Many other patches, documentation files, mini kernels, utilities, ... N: Masanori GOTO E: gotom@debian.or.jp D: Workbit NinjaSCSI-32Bi/UDE driver S: Japan N: John E. Gotts E: jgotts@linuxsavvy.com D: kernel hacker S: 8124 Constitution Apt. 7 S: Sterling Heights, Michigan 48313 S: USA N: Wolfgang Grandegger E: wg@grandegger.com D: Controller Area Network (device drivers) N: William Greathouse E: wgreathouse@smva.com E: wgreathouse@myfavoritei.com D: Current Belkin USB Serial Adapter F5U103 hacker D: Kernel hacker, embedded systems S: 7802 Fitzwater Road S: Brecksville, OH 44141-1334 S: USA N: Tristan Greaves E: tristan@extricate.org W: http://www.extricate.org/ D: Miscellaneous ipv4 sysctl patches N: Michael A. Griffith E: grif@cs.ucr.edu W: http://www.cs.ucr.edu/~grif D: Loopback speedup, qlogic SCSI hacking, VT_LOCKSWITCH S: Department of Computer Science S: University of California, Riverside S: Riverside, California 92521-0304 S: USA N: Hans Grobler E: grobh@sun.ac.za D: Various AX.25/ROSE/NETROM + hamradio driver patches D: Various X.25/LABP + driver patches D: Misc kernel fixes and updates S: Department of Electronic Engineering S: University of Stellenbosch S: Stellenbosch, Western Cape S: South Africa N: Grant Grundler E: grundler@parisc-linux.org W: http://obmouse.sourceforge.net/ W: http://www.parisc-linux.org/ D: obmouse - rewrote Olivier Florent's Omnibook 600 "pop-up" mouse driver D: PA-RISC - Interrupt/PCI HBA/IOMMU author and architect S: Mountain View, California S: USA N: Grant Guenther E: grant@torque.net W: http://www.torque.net/linux-pp.html D: original author of ppa driver for parallel port ZIP drive D: original architect of the parallel-port sharing scheme D: PARIDE subsystem: drivers for parallel port IDE & ATAPI devices S: 44 St. Joseph Street, Suite 506 S: Toronto, Ontario, M4Y 2W4 S: Canada N: Richard Gunther E: rguenth@tat.physik.uni-tuebingen.de W: http://www.tat.physik.uni-tuebingen.de/~rguenth P: 2048/2E829319 2F 83 FC 93 E9 E4 19 E2 93 7A 32 42 45 37 23 57 D: binfmt_misc S: 72074 Tubingen S: Germany N: Justin Guyett E: jguyett@andrew.cmu.edu D: via-rhine net driver hacking N: Danny ter Haar E: dth@cistron.nl D: /proc/cpuinfo, reboot on panic , kernel pre-patch tester ;) S: Cistron S: PO-Box 297 S: 2400 AG, Alphen aan den Rijn S: The Netherlands N: Enver Haase E: ehaase@inf.fu-berlin.de W: http://www.inf.fu-berlin.de/~ehaase D: Driver for the Commodore A2232 serial board N: Bruno Haible E: haible@ma2s2.mathematik.uni-karlsruhe.de D: SysV FS, shm swapping, memory management fixes S: 17 rue Danton S: F - 94270 Le Kremlin-Bicetre S: France N: Greg Hankins E: gregh@cc.gatech.edu D: fixed keyboard driver to separate LED and locking status S: 25360 Georgia Tech Station S: Atlanta, Georgia 30332 S: USA N: Brad Hards E: bradh@frogmouth.net D: Various USB bits, other minor patches N: Angelo Haritsis E: ah@computer.org D: kernel patches (serial, watchdog) D: xringd, vuzkern, greekXfonts S: 77 Clarence Mews S: London SE16 1GD S: United Kingdom N: Jan Harkes E: jaharkes@cs.cmu.edu W: http://www.coda.cs.cmu.edu/ D: Coda file system S: Computer Science Department S: Carnegie Mellon University S: 5000 Forbes Avenue S: Pittsburgh, Pennsylvania 15213 S: USA N: Kai Harrekilde-Petersen E: kai.harrekilde@get2net.dk D: Original author of the ftape-HOWTO, i82078 fdc detection code. N: Bart Hartgers E: bart@etpmod.phys.tue.nl D: MTRR emulation with Centaur MCRs S: Gen Stedmanstraat 212 S: 5623 HZ Eindhoven S: The Netherlands N: Oliver Hartkopp E: oliver.hartkopp@volkswagen.de W: http://www.volkswagen.de D: Controller Area Network (network layer core) S: Brieffach 1776 S: 38436 Wolfsburg S: Germany N: Andrew Haylett E: ajh@primag.co.uk D: Selection mechanism N: Andre Hedrick E: andre@linux-ide.org E: andre@linuxdiskcert.org W: http://www.linux-ide.org/ W: http://www.linuxdiskcert.org/ D: Random SMP kernel hacker... D: Uniform Multi-Platform E-IDE driver D: Active-ATA-Chipset maddness.......... D: Ultra DMA 133/100/66/33 w/48-bit Addressing D: ATA-Disconnect, ATA-TCQ D: ATA-Smart Kernel Daemon D: Serial ATA D: ATA Command Block and Taskfile S: Linux ATA Development (LAD) S: Concord, CA N: Jochen Hein E: jochen@jochen.org P: 1024/4A27F015 25 72 FB E3 85 9F DE 3B CB 0A DA DA 40 77 05 6C P: 1024D/77D4FC9B F5C5 1C20 1DFC DEC3 3107 54A4 2332 ADFC 77D4 FC9B D: National Language Support D: Linux Internationalization Project D: German Localization for Linux and GNU software S: Auf der Fittel 18 S: 53347 Alfter S: Germany N: Christoph Hellwig E: hch@infradead.org D: all kinds of driver, filesystem & core kernel hacking D: freevxfs driver D: sysvfs maintainer D: chief codingstyle nitpicker S: Ampferstr. 50 / 4 S: 6020 Innsbruck S: Austria N: Richard Henderson E: rth@twiddle.net E: rth@cygnus.com D: Alpha hacker, kernel and userland S: 1668 California St. S: Mountain View, California 94041 S: USA N: Benjamin Herrenschmidt E: benh@kernel.crashing.org D: Various parts of PPC/PPC64 & PowerMac S: 312/107 Canberra Avenue S: Griffith, ACT 2603 S: Australia N: Sebastian Hetze E: she@lunetix.de D: German Linux Documentation, D: Organization of German Linux Conferences S: Danckelmannstr. 48 S: 14059 Berlin S: Germany N: David Hinds E: dahinds@users.sourceforge.net W: http://tao.stanford.edu/~dhinds D: PCMCIA and CardBus stuff, PCMCIA-HOWTO, PCMCIA client drivers S: 2019 W. Middlefield Rd #1 S: Mountain View, CA 94043 S: USA N: Michael Hipp E: hippm@informatik.uni-tuebingen.de D: drivers for the racal ni5210 & ni6510 Ethernet-boards S: Talstr. 1 S: D - 72072 Tuebingen S: Germany N: Richard Hirst E: richard@sleepie.demon.co.uk E: rhirst@linuxcare.com W: http://www.sleepie.demon.co.uk/ D: linux-m68k VME support D: PA-RISC port, scsi and network drivers D: 53c700/53c710 driver author, 82596 driver maintainer S: United Kingdom N: Jauder Ho E: jauderho@carumba.com W: http://www.carumba.com/ D: bug toaster (A1 sauce makes all the difference) D: Random linux hacker N: Tim Hockin E: thockin@hockin.org W: http://www.hockin.org/~thockin D: Natsemi ethernet D: Cobalt Networks (x86) support D: This-and-That N: Dirk Hohndel E: hohndel@suse.de D: The XFree86[tm] Project D: USB mouse maintainer S: SuSE Rhein/Main AG S: Mergenthalerallee 45-47 S: 65760 Eschborn S: Germany N: Kenji Hollis E: kenji@bitgate.com W: http://www.bitgate.com/ D: Berkshire PC Watchdog Driver D: Small/Industrial Driver Project N: Nick Holloway E: Nick.Holloway@pyrites.org.uk W: http://www.pyrites.org.uk/ P: 1024/36115A04 F4E1 3384 FCFD C055 15D6 BA4C AB03 FBF8 3611 5A04 D: Occasional Linux hacker... S: (ask for current address) S: United Kingdom N: Ron Holt E: ron@holt.org E: rholt@netcom.com W: http://www.holt.org/ W: http://www.ronholt.com/ D: Kernel development D: Kernel LDT modifications to support Wabi and Wine S: Holtron Internetics, Inc. S: 998 East 900 South, Suite 26 S: Provo, Utah 84606-5607 S: USA N: Marcel Holtmann E: marcel@holtmann.org W: http://www.holtmann.org D: Maintainer of the Linux Bluetooth Subsystem D: Author and maintainer of the various Bluetooth HCI drivers D: Author and maintainer of the CAPI message transport protocol driver D: Author and maintainer of the Bluetooth HID protocol driver D: Various other Bluetooth related patches, cleanups and fixes S: Germany N: Rob W. W. Hooft E: hooft@EMBL-Heidelberg.DE D: Shared libs for graphics-tools and for the f2c compiler D: Some kernel programming on the floppy and sound drivers in early days D: Some other hacks to get different kinds of programs to work for linux S: Panoramastrasse 18 S: D-69126 Heidelberg S: Germany N: Christopher Horn E: chorn@warwick.net D: Miscellaneous sysctl hacks S: 36 Mudtown Road S: Wantage, New Jersey 07461 S: USA N: Harald Hoyer E: harald.hoyer@parzelle.de W: http://parzelle.de/ D: ip_masq_quake D: md boot support S: Hohe Strasse 30 S: D-70176 Stuttgart S: Germany N: Jan Hubicka E: hubicka@freesoft.cz E: hubicka@suse.cz W: http://www.paru.cas.cz/~hubicka/ D: Random kernel tweaks and fixes. S: Dukelskych bojovniku 1944 S: Tabor 390 03 S: Czech Republic N: David Huggins-Daines E: dhd@debian.org E: dhd@eradicator.org E: dhd@cepstral.com D: PA-RISC port D: Nubus subsystem D: Generic 68k Macintosh framebuffer driver D: STI framebuffer tweaks D: LTPC driver tweaks S: 110 S. 12th St., Apt. A S: Pittsburgh, PA 15203-1250 S: USA N: Gareth Hughes E: gareth.hughes@acm.org D: Pentium III FXSR, SSE support D: Author/maintainer of most DRM drivers (especially ATI, MGA) D: Core DRM templates, general DRM and 3D-related hacking S: No fixed address N: Kenn Humborg E: kenn@wombat.ie D: Mods to loop device to support sparse backing files S: Ballinagard S: Roscommon S: Ireland N: Michael Hunold E: michael@mihu.de W: http://www.mihu.de/linux/ D: Generic saa7146 video4linux-2 driver core, D: Driver for the "Multimedia eXtension Board", "dpc7146", D: "Hexium Orion", "Hexium Gemini" N: Miguel de Icaza Amozurrutia E: miguel@nuclecu.unam.mx D: Linux/SPARC team, Midnight Commander maintainer S: Avenida Copilco 162, 22-1003 S: Mexico, DF S: Mexico N: Ian Jackson E: iwj10@cus.cam.ac.uk E: ijackson@nyx.cs.du.edu D: FAQ maintainer and poster of the daily postings D: FSSTND group member D: Debian core team member and maintainer of several Debian packages S: 2 Lexington Close S: Cambridge S: CB3 0DS S: United Kingdom N: Andreas Jaeger E: aj@suse.de D: Various smaller kernel fixes D: glibc developer S: Gottfried-Kinkel-Str. 18 S: D 67659 Kaiserslautern S: Germany N: Mike Jagdis E: jaggy@purplet.demon.co.uk E: Mike.Jagdis@purplet.demon.co.uk D: iBCS personalities, socket and X interfaces, x.out loader, syscalls... D: Purple Distribution maintainer D: UK FidoNet support D: ISODE && PP D: Kernel and device driver hacking S: 280 Silverdale Road S: Earley S: Reading S: RG6 2NU S: United Kingdom N: Jakub Jelinek E: jakub@redhat.com W: http://sunsite.mff.cuni.cz/~jj P: 1024/0F7623C5 53 95 71 3C EB 73 99 97 02 49 40 47 F9 19 68 20 D: Sparc hacker, SILO, mc D: Maintain sunsite.mff.cuni.cz S: K osmidomkum 723 S: 160 00 Praha 6 S: Czech Republic N: Niels Kristian Bech Jensen E: nkbj1970@hotmail.com D: Miscellaneous kernel updates and fixes. N: Michael K. Johnson E: johnsonm@redhat.com W: http://www.redhat.com/~johnsonm P: 1024/4536A8DD 2A EC 88 08 40 64 CE D8 DD F8 12 2B 61 43 83 15 D: The Linux Documentation Project D: Kernel Hackers' Guide D: Procps D: Proc filesystem D: Maintain tsx-11.mit.edu D: LP driver S: 201 Howell Street, Apartment 1C S: Chapel Hill, North Carolina 27514-4818 S: USA N: Dave Jones E: davej@redhat.com W: http://www.codemonkey.org.uk D: Assorted VIA x86 support. D: 2.5 AGPGART overhaul. D: CPUFREQ maintenance. D: Fedora kernel maintenance. D: Misc/Other. S: 314 Littleton Rd, Westford, MA 01886, USA N: Martin Josfsson E: gandalf@wlug.westbo.se P: 1024D/F6B6D3B1 7610 7CED 5C34 4AA6 DBA2 8BE1 5A6D AF95 F6B6 D3B1 D: netfilter: SAME target D: netfilter: helper target D: netfilter: various other hacks S: Ronneby S: Sweden N: Ani Joshi E: ajoshi@shell.unixbox.com D: fbdev hacking N: Jesper Juhl E: jj@chaosbits.net D: Various fixes, cleanups and minor features all over the tree. D: Wrote initial version of the hdaps driver (since passed on to others). S: Lemnosvej 1, 3.tv S: 2300 Copenhagen S. S: Denmark N: Jozsef Kadlecsik E: kadlec@blackhole.kfki.hu P: 1024D/470DB964 4CB3 1A05 713E 9BF7 FAC5 5809 DD8C B7B1 470D B964 D: netfilter: TCP window tracking code D: netfilter: raw table D: netfilter: iprange match D: netfilter: new logging interfaces D: netfilter: various other hacks S: Tata S: Hungary N: Bernhard Kaindl E: bkaindl@netway.at E: edv@bartelt.via.at D: Author of a menu based configuration tool, kmenu, which D: is the predecessor of 'make menuconfig' and 'make xconfig'. D: digiboard driver update(modularisation work and 2.1.x upd) S: Tallak 95 S: 8103 Rein S: Austria N: Mitsuru Kanda E: mk@linux-ipv6.org E: mk@isl.rdc.toshiba.co.jp E: mk@karaba.org W: http://www.karaba.org/~mk/ P: 1024D/2EC7E30D 4DC3 949B 5A6C F0D6 375F 4472 8888 A8E1 2EC7 E30D D: IPsec, IPv6 D: USAGI/WIDE Project, TOSHIBA CORPORATION S: 2-47-8, Takinogawa, S: Kita, Tokyo 114-0023 S: Japan N: Jan Kara E: jack@atrey.karlin.mff.cuni.cz E: jack@suse.cz D: Quota fixes for 2.2 kernel D: Quota fixes for 2.3 kernel D: Few other fixes in filesystem area (buffer cache, isofs, loopback) W: http://atrey.karlin.mff.cuni.cz/~jack/ S: Krosenska' 543 S: 181 00 Praha 8 S: Czech Republic N: Jan "Yenya" Kasprzak E: kas@fi.muni.cz D: Author of the COSA/SRP sync serial board driver. D: Port of the syncppp.c from the 2.0 to the 2.1 kernel. P: 1024/D3498839 0D 99 A7 FB 20 66 05 D7 8B 35 FC DE 05 B1 8A 5E W: http://www.fi.muni.cz/~kas/ S: c/o Faculty of Informatics, Masaryk University S: Botanicka' 68a S: 602 00 Brno S: Czech Republic N: Jakob Kemi E: jakob.kemi@telia.com D: V4L W9966 Webcam driver S: Forsbyvagen 33 S: 74143 Knivsta S: Sweden N: Fred N. van Kempen E: waltje@linux.com D: NET-2 D: Drivers D: Kernel cleanups S: Korte Heul 95 S: 1403 ND BUSSUM S: The Netherlands N: Karl Keyte E: karl@koft.com D: Disk usage statistics and modifications to line printer driver S: 26a Sheen Road S: Richmond S: Surrey S: TW9 1AE S: United Kingdom N: Marko Kiiskila E: marko@iprg.nokia.com D: Author of ATM Lan Emulation S: 660 Harvard Ave. #7 S: Santa Clara, CA 95051 S: USA N: Russell King E: rmk@arm.linux.org.uk D: Linux/arm integrator, maintainer & hacker D: Acornfb, Cyber2000fb author S: Burgh Heath, Tadworth, Surrey. S: England N: Olaf Kirch E: okir@monad.swb.de D: Author of the Linux Network Administrators' Guide S: Kattreinstr 38 S: D-64295 S: Germany N: Andi Kleen E: andi@firstfloor.org U: http://www.halobates.de D: network, x86, NUMA, various hacks S: Schwalbenstr. 96 S: 85551 Ottobrunn S: Germany N: Ian Kluft E: ikluft@thunder.sbay.org W: http://www.kluft.com/~ikluft/ D: NET-1 beta testing & minor patches, original Smail binary packages for D: Slackware and Debian, vote-taker for 2nd comp.os.linux reorganization S: Post Office Box 611311 S: San Jose, California 95161-1311 S: USA N: Thorsten Knabe E: Thorsten Knabe E: Thorsten Knabe W: http://www.student.informatik.tu-darmstadt.de/~tek W: http://www.tu-darmstadt.de/~tek01 P: 1024/3BC8D885 8C 29 C5 0A C0 D1 D6 F4 20 D4 2D AB 29 F6 D0 60 D: AD1816 sound driver S: Am Bergfried 10 S: 63225 Langen S: Germany N: Alain L. Knaff E: Alain.Knaff@lll.lu D: floppy driver S: 19, rue Jean l'Aveugle S: L-1148 Luxembourg-City S: Luxembourg N: Gerd Knorr W: http://bytesex.org E: kraxel@bytesex.org E: kraxel@suse.de D: video4linux, bttv, vesafb, some scsi, misc fixes N: Harald Koenig E: koenig@tat.physik.uni-tuebingen.de D: XFree86 (S3), DCF77, some kernel hacks and fixes S: Koenigsberger Str. 90 S: D-72336 Balingen S: Germany N: Rudolf Koenig E: rfkoenig@immd4.informatik.uni-erlangen.de D: The Linux Support Team Erlangen N: Andreas Koensgen E: ajk@comnets.uni-bremen.de D: 6pack driver for AX.25 N: Harald Koerfgen E: hkoerfg@web.de D: Linux/MIPS kernel hacks and fixes, D: DECstation port, Sharp Mobilon port S: D-50931 Koeln S: Germany N: Willy Konynenberg E: willy@xos.nl W: http://www.xos.nl/ D: IP transparent proxy support S: X/OS Experts in Open Systems BV S: Kruislaan 419 S: 1098 VA Amsterdam S: The Netherlands N: Goran Koruga E: korugag@siol.net D: cpufrequtils (precursor to cpupowerutils) S: Slovenia N: Jiri Kosina E: jikos@jikos.cz E: jkosina@suse.cz D: Generic HID layer - original code split, fixes D: Various ACPI fixes, keeping correct battery state through suspend D: various lockdep annotations, autofs and other random bugfixes S: Prague, Czech Republic N: Gene Kozin E: 74604.152@compuserve.com W: http://www.sangoma.com D: WAN Router & Sangoma WAN drivers S: Sangoma Technologies Inc. S: 7170 Warden Avenue, Unit 2 S: Markham, Ontario S: L3R 8B2 S: Canada N: Maxim Krasnyansky E: maxk@qualcomm.com W: http://vtun.sf.net W: http://bluez.sf.net D: Author of the Universal TUN/TAP driver D: Author of the Linux Bluetooth Subsystem (BlueZ) D: Various other kernel patches, cleanups and fixes S: 2213 La Terrace Circle S: San Jose, CA 95123 S: USA N: Andreas S. Krebs E: akrebs@altavista.net D: CYPRESS CY82C693 chipset IDE, Digital's PC-Alpha 164SX boards N: Greg Kroah-Hartman E: greg@kroah.com E: gregkh@suse.de W: http://www.kroah.com/linux/ D: USB Serial Converter driver framework, USB Handspring Visor driver D: ConnectTech WHITEHeat USB driver, Generic USB Serial driver D: USB I/O Edgeport driver, USB Serial IrDA driver D: USB Bluetooth driver, USB Skeleton driver D: bits and pieces of USB core code. D: PCI Hotplug core, PCI Hotplug Compaq driver modifications D: portions of the Linux Security Module (LSM) framework D: parts of the driver core, debugfs. N: Russell Kroll E: rkroll@exploits.org W: http://www.exploits.org/ D: V4L radio cards: radio-aztech (new), others (bugfixes/features) D: Loopback block device: dynamic sizing ("max_loop" as module) S: Post Office Box 691886 S: San Antonio, Texas 78269-1886 S: USA N: Denis O. Kropp E: dok@directfb.org D: NeoMagic framebuffer driver S: Badensche Str. 46 S: 10715 Berlin S: Germany N: Andrzej M. Krzysztofowicz E: ankry@mif.pg.gda.pl D: Some 8-bit XT disk driver and devfs hacking D: Aladdin 1533/1543(C) chipset IDE D: PIIX chipset IDE S: ul. Matemblewska 1B/10 S: 80-283 Gdansk S: Poland N: Gero Kuhlmann E: gero@gkminix.han.de D: mounting root via NFS S: Donarweg 4 S: D-30657 Hannover S: Germany N: Markus Kuhn E: mskuhn@cip.informatik.uni-erlangen.de W: http://wwwcip.informatik.uni-erlangen.de/user/mskuhn D: Unicode, real-time, time, standards S: Schlehenweg 9 S: D-91080 Uttenreuth S: Germany N: Jaya Kumar E: jayalk@intworks.biz W: http://www.intworks.biz D: Arc monochrome LCD framebuffer driver, x86 reboot fixups D: pirq addr, CS5535 alsa audio driver S: Gurgaon, India S: Kuala Lumpur, Malaysia N: Gabor Kuti M: seasons@falcon.sch.bme.hu M: seasons@makosteszta.sote.hu D: Original author of software suspend N: Jaroslav Kysela E: perex@perex.cz W: http://www.perex.cz D: Original Author and Maintainer for HP 10/100 Mbit Network Adapters D: ISA PnP S: Sindlovy Dvory 117 S: 370 01 Ceske Budejovice S: Czech Republic N: Bas Laarhoven E: sjml@xs4all.nl D: Loadable modules and ftape driver S: J. Obrechtstr 23 S: NL-5216 GP 's-Hertogenbosch S: The Netherlands N: Savio Lam E: lam836@cs.cuhk.hk D: Author of the dialog utility, foundation D: for Menuconfig's lxdialog. N: Christoph Lameter E: christoph@lameter.com D: Digiboard PC/Xe and PC/Xi, Digiboard EPCA D: NUMA support, Slab allocators, Page migration D: Scalability, Time subsystem N: Paul Laufer E: paul@laufernet.com D: Soundblaster driver fixes, ISAPnP quirk S: California, USA N: Jonathan Layes D: ARPD support N: Tom Lees E: tom@lpsg.demon.co.uk W: http://www.lpsg.demon.co.uk/ P: 1024/87D4D065 2A 66 86 9D 02 4D A6 1E B8 A2 17 9D 4F 9B 89 D6 D: Original author and current maintainer of D: PnP code. N: David van Leeuwen E: david@tm.tno.nl D: Philips/LMS cm206 cdrom driver, generic cdrom driver S: Scheltemalaan 14 S: 3817 KS Amersfoort S: The Netherlands N: Volker Lendecke E: vl@kki.org D: Kernel smbfs (to mount WfW, NT and OS/2 network drives.) D: NCP filesystem support (to mount NetWare volumes) S: Von-Ossietzky-Str. 12 S: 37085 Gottingen S: Germany N: Kevin Lentin E: kevinl@cs.monash.edu.au D: NCR53C400/T130B SCSI extension to NCR5380 driver. S: 18 Board Street S: Doncaster VIC 3108 S: Australia N: Hans Lermen E: lermen@elserv.ffm.fgan.de D: Author of the LOADLIN Linux loader, hacking on boot stuff D: Coordinator of DOSEMU releases S: Am Muehlenweg 38 S: D53424 Remagen S: Germany N: Colin Leroy E: colin@colino.net W: http://www.geekounet.org/ D: PowerMac adt746x fan driver D: Random fixing of various drivers (macintosh, usb, sound) S: Toulouse S: France N: Achim Leubner E: achim_leubner@adaptec.com D: GDT Disk Array Controller/Storage RAID controller driver S: ICP vortex GmbH S: Neckarsulm S: Germany N: Phil Lewis E: beans@bucket.ualr.edu D: Promised to send money if I would put his name in the source tree. S: Post Office Box 371 S: North Little Rock, Arkansas 72115 S: USA N: Stephan Linz E: linz@mazet.de E: Stephan.Linz@gmx.de W: http://www.crosswinds.net/~tuxer D: PCILynx patch to work with 1394a PHY and without local RAM S: (ask for current address) S: Germany N: Christophe Lizzi E: lizzi@cnam.fr W: http://cedric.cnam.fr/personne/lizzi D: FORE Systems 200E-series ATM network driver, sparc64 port of ATM S: CNAM, Laboratoire CEDRIC S: 292, rue St-Martin S: 75141 Paris Cedex 03 S: France N: Siegfried "Frieder" Loeffler (dg1sek) E: floeff@tunix.mathematik.uni-stuttgart.de, fl@LF.net W: http://www.mathematik.uni-stuttgart.de/~floeff D: Busmaster driver for HP 10/100 Mbit Network Adapters S: University of Stuttgart, Germany and S: Ecole Nationale Superieure des Telecommunications, Paris S: France N: Jamie Lokier E: jamie@shareable.org W: http://www.shareable.org/ D: Reboot-through-BIOS for broken 486 motherboards D: Parport fixes, futex improvements D: First instruction of x86 sysenter path :) S: 51 Sunningwell Road S: Oxford S: OX1 4SZ S: United Kingdom N: Mark Lord E: mlord@pobox.com D: EIDE driver, hd.c support D: EIDE PCI and bus-master DMA support D: Hard Disk Parameter (hdparm) utility S: 33 Ridgefield Cr S: Nepean, Ontario S: Canada K2H 6S3 N: Warner Losh E: imp@village.org D: Linux/MIPS Deskstation support, Provided OI/OB for Linux S: 8786 Niwot Road S: Niwot, Colorado 80503 S: USA N: Robert M. Love E: rml@tech9.net E: rml@novell.com D: misc. kernel hacking and debugging S: Cambridge, MA 02139 S: USA N: Martin von Lowis E: loewis@informatik.hu-berlin.de D: script binary format D: NTFS driver N: H.J. Lu E: hjl@gnu.ai.mit.edu D: GCC + libraries hacker N: Yanir Lubetkin E: yanirx.lubatkin@intel.com E: linux-wimax@intel.com D: Intel Wireless WiMAX Connection 2400 driver N: Michal Ludvig E: michal@logix.cz E: michal.ludvig@asterisk.co.nz W: http://www.logix.cz/michal P: 1024D/C45B2218 1162 6471 D391 76E0 9F99 29DA 0C3A 2509 C45B 2218 D: VIA PadLock driver D: Netfilter pkttype module S: Asterisk Ltd. S: Auckland S: New Zealand N: Tuomas J. Lukka E: Tuomas.Lukka@Helsinki.FI D: Original dual-monitor patches D: Console-mouse-tracking patches S: Puistokaari 1 E 18 S: 00200 Helsinki S: Finland N: Daniel J. Maas E: dmaas@dcine.com W: http://www.maasdigital.com D: dv1394 N: Hamish Macdonald E: hamishm@lucent.com D: Linux/68k port S: 32 Clydesdale Avenue S: Kanata, Ontario S: Canada K2M-2G7 N: Peter MacDonald D: SLS distribution D: Initial implementation of VC's, pty's and select() N: Pavel Machek E: pavel@ucw.cz D: Softcursor for vga, hypertech cdrom support, vcsa bugfix, nbd D: sun4/330 port, capabilities for elf, speedup for rm on ext2, USB, D: work on suspend-to-ram/disk, killing duplicates from ioctl32 S: Volkova 1131 S: 198 00 Praha 9 S: Czech Republic N: Paul Mackerras E: paulus@samba.org D: PPP driver D: Linux for PowerPC D: Linux port for PCI Power Macintosh N: Pat Mackinlay E: pat@it.com.au D: 8 bit XT hard disk driver D: Miscellaneous ST0x, TMC-8xx and other SCSI hacking S: 25 McMillan Street S: Victoria Park 6100 S: Australia N: James B. MacLean E: macleajb@ednet.ns.ca W: http://www.ednet.ns.ca/~macleajb/dosemu.html D: Former Coordinator of DOSEMU releases D: Program in DOSEMU S: PO BOX 220, HFX. CENTRAL S: Halifax, Nova Scotia S: Canada B3J 3C8 N: Kai Makisara E: Kai.Makisara@kolumbus.fi D: SCSI Tape Driver N: Asit Mallick E: asit.k.mallick@intel.com D: Linux/IA-64 S: 2200 Mission College Blvd S: Santa Clara, CA 95052 S: USA N: Petko Manolov E: petkan@users.sourceforge.net D: USB ethernet pegasus/pegasus-II driver D: USB ethernet rtl8150 driver D: optimizing i[45]86 string routines D: i386 task switching hacks S: 482 Shadowgraph Dr. S: San Jose, CA 95110 S: USA N: Martin Mares E: mj@ucw.cz W: http://www.ucw.cz/~mj/ D: BIOS video mode handling code D: MOXA C-218 serial board driver D: Network autoconfiguration D: PCI subsystem D: Random kernel hacking S: Kankovskeho 1241 S: 182 00 Praha 8 S: Czech Republic N: John A. Martin E: jam@acm.org W: http://www.tux.org/~jam/ P: 1024/04456D53 9D A3 6C 6B 88 80 8A 61 D7 06 22 4F 95 40 CE D2 P: 1024/3B986635 5A61 7EE6 9E20 51FB 59FB 2DA5 3E18 DD55 3B98 6635 D: FSSTND contributor D: Credit file compilator N: Kevin E. Martin E: martin@cs.unc.edu D: Developed original accelerated X servers included in XFree86 D: XF86_Mach64 D: XF86_Mach32 D: XF86_Mach8 D: XF86_8514 D: cfdisk (curses based disk partitioning program) N: John S. Marvin E: jsm@fc.hp.com D: PA-RISC port S: Hewlett Packard S: MS 42 S: 3404 E. Harmony Road S: Fort Collins, CO 80528 S: USA N: Torben Mathiasen E: torben.mathiasen@compaq.com E: torben@kernel.dk W: http://tlan.kernel.dk D: ThunderLAN maintainer D: ThunderLAN updates and other kernel fixes. S: Bremensgade 29, st.th S: 2300 Copenhagen S S: Denmark N: Claudio S. Matsuoka E: cmatsuoka@gmail.com E: claudio@mandriva.com W: http://helllabs.org/~claudio D: V4L, OV511 and HDA-codec hacks S: Conectiva S.A. S: Souza Naves 1250 S: 80050-040 Curitiba PR S: Brazil N: Heinz Mauelshagen E: mge@EZ-Darmstadt.Telekom.de D: Logical Volume Manager S: Bartningstr. 12 S: 64289 Darmstadt S: Germany N: Mark W. McClelland E: mmcclell@bigfoot.com E: mark@alpha.dyndns.org W: http://alpha.dyndns.org/ov511/ P: 1024D/357375CC 317C 58AC 1B39 2AB0 AB96 EB38 0B6F 731F 3573 75CC D: OV511 driver S: (address available on request) S: USA N: Ian McDonald E: ian.mcdonald@jandi.co.nz E: imcdnzl@gmail.com W: http://wand.net.nz/~iam4 W: http://imcdnzl.blogspot.com D: DCCP, CCID3 S: Hamilton S: New Zealand N: Patrick McHardy E: kaber@trash.net P: 1024D/12155E80 B128 7DE6 FF0A C2B2 48BE AB4C C9D4 964E 1215 5E80 D: netfilter: endless number of bugfixes D: netfilter: CLASSIFY target D: netfilter: addrtype match D: tc: HFSC scheduler S: Freiburg S: Germany N: Paul E. McKenney E: paulmck@us.ibm.com W: http://www.rdrop.com/users/paulmck/ D: RCU and variants D: rcutorture module N: Mike McLagan E: mike.mclagan@linux.org W: http://www.invlogic.com/~mmclagan D: DLCI/FRAD drivers for Sangoma SDLAs S: Innovative Logic Corp S: Post Office Box 1068 S: Laurel, Maryland 20732 S: USA N: Bradley McLean E: brad@bradpc.gaylord.com D: Device driver hacker D: General kernel debugger S: 249 Nichols Avenue S: Syracuse, New York 13206 S: USA N: Kyle McMartin E: kyle@parisc-linux.org D: Linux/PARISC hacker D: AD1889 sound driver S: Ottawa, Canada N: Dirk Melchers E: dirk@merlin.nbg.sub.org D: 8 bit XT hard disk driver for OMTI5520 S: Schloessleinsgasse 31 S: D-90453 Nuernberg S: Germany N: Arnaldo Carvalho de Melo E: acme@ghostprotocols.net E: arnaldo.melo@gmail.com E: acme@redhat.com W: http://oops.ghostprotocols.net:81/blog/ P: 1024D/9224DF01 D5DF E3BB E3C8 BCBB F8AD 841A B6AB 4681 9224 DF01 D: IPX, LLC, DCCP, cyc2x, wl3501_cs, net/ hacks S: Brazil N: Karsten Merker E: merker@linuxtag.org D: DECstation framebuffer drivers S: Germany N: Michael Meskes E: meskes@debian.org P: 1024/04B6E8F5 6C 77 33 CA CC D6 22 03 AB AB 15 A3 AE AD 39 7D D: Kernel hacker. PostgreSQL hacker. Software watchdog daemon. D: Maintainer of several Debian packages S: Th.-Heuss-Str. 61 S: D-41812 Erkelenz S: Germany N: Nigel Metheringham E: Nigel.Metheringham@ThePLAnet.net P: 1024/31455639 B7 99 BD B8 00 17 BD 46 C1 15 B8 AB 87 BC 25 FA D: IP Masquerading work and minor fixes S: Planet Online S: The White House, Melbourne Street, LEEDS S: LS2 7PS, United Kingdom N: Craig Metz E: cmetz@inner.net D: Some of PAS 16 mixer & PCM support, inet6-apps N: William (Bill) Metzenthen E: billm@suburbia.net D: Author of the FPU emulator. D: Minor kernel hacker for other lost causes (Hercules mono, etc). S: 22 Parker Street S: Ormond S: Victoria 3163 S: Australia N: Pauline Middelink E: middelin@polyware.nl D: General low-level bug fixes, /proc fixes, identd support D: Author of IP masquerading D: Zoran ZR36120 Video For Linux driver S: Boterkorfhoek 34 S: 7546 JA Enschede S: Netherlands N: David S. Miller E: davem@davemloft.net D: Sparc and blue box hacker D: Vger Linux mailing list co-maintainer D: Linux Emacs elf/qmagic support + other libc/gcc things D: Yee bore de yee bore! ;-) S: 575 Harrison St. #103 S: San Francisco, CA 94105 S: USA N: Rick Miller E: rdmiller@execpc.com W: http://www.execpc.com/~rdmiller/ D: Original Linux Device Registrar (Major/minor numbers) D: au-play, bwBASIC S: S78 W16203 Woods Road S: Muskego, Wisconsin 53150 S: USA N: Harald Milz E: hm@seneca.linux.de D: Linux Projects Map, Linux Commercial-HOWTO D: general Linux publicity in Germany, vacation port D: UUCP and CNEWS binary packages for LST S: Editorial Board iX Mag S: Helstorfer Str. 7 S: D-30625 Hannover S: Germany N: Corey Minyard E: minyard@wf-rch.cirr.com E: minyard@mvista.com W: http://home.attbi.com/~minyard D: Sony CDU31A CDROM Driver D: IPMI driver D: Various networking fixes long ago D: Original ppc_md work D: Shared zlib S: 7406 Wheat Field Rd S: Garland, Texas 75044 S: USA N: Kazunori Miyazawa E: miyazawa@linux-ipv6.org E: Kazunori.Miyazawa@jp.yokogawa.com E: kazunori@miyazawa.org W: http://www.miyazawa.org/~kazunori/ D: IPsec, IPv6 D: USAGI/WIDE Project, Yokogawa Electric Corporation S: 2-20-4-203, Nakacho, S: Musashino, Tokyo 180-0006 S: Japan N: Patrick Mochel E: mochel@osdl.org E: mochelp@infinity.powertie.org D: PCI Power Management, ACPI work S: 12725 SW Millikan Way, Suite 400 S: Beaverton, Oregon 97005 S: USA N: Eberhard Monkeberg E: emoenke@gwdg.de D: CDROM driver "sbpcd" (Matsushita/Panasonic/Soundblaster) S: Ruhstrathohe 2 b. S: D-37085 Gottingen S: Germany N: Thomas Molina E: tmolina@cablespeed.com D: bug fixes, documentation, minor hackery N: Paul Moore E: paul.moore@hp.com D: NetLabel author S: Hewlett-Packard S: 110 Spit Brook Road S: Nashua, NH 03062 N: James Morris E: jmorris@namei.org W: http://namei.org/ D: Netfilter, Linux Security Modules (LSM), SELinux, IPSec, D: Crypto API, general networking, miscellaneous. S: PO Box 707 S: Spit Junction NSW 2088 S: Australia N: David Mosberger-Tang E: davidm@hpl.hp.com if IA-64 related, else David.Mosberger@acm.org D: Linux/Alpha and Linux/ia64 S: 35706 Runckel Lane S: Fremont, California 94536 S: USA N: Sam Mosel E: sam.mosel@computer.org D: Wacom Intuos USB Support S: 22 Seaview St S: Fullarton 5063 S: South Australia N. Wolfgang Muees E: wolfgang@iksw-muees.de D: Auerswald USB driver N: Ian A. Murdock E: imurdock@gnu.ai.mit.edu D: Creator of Debian distribution S: 30 White Tail Lane S: Lafayette, Indiana 47905 S: USA N: Scott Murray E: scottm@somanetworks.com E: scott@spiteful.org D: OPL3-SA2, OPL3-SA3 sound driver D: CompactPCI hotplug core D: Ziatech ZT5550 and generic CompactPCI hotplug drivers S: Toronto, Ontario S: Canada N: Zwane Mwaikambo E: zwane@arm.linux.org.uk D: Various driver hacking D: Lowlevel x86 kernel hacking D: General debugging S: (ask for current address) S: Tanzania N: Trond Myklebust E: trond.myklebust@fys.uio.no D: current NFS client hacker. S: Dagaliveien 31e S: N-0391 Oslo S: Norway N: Johan Myreen E: jem@iki.fi D: PS/2 mouse driver writer etc. S: Dragonvagen 1 A 13 S: FIN-00330 Helsingfors S: Finland N: Matija Nalis E: mnalis@jagor.srce.hr E: mnalis@voyager.hr D: Maintainer of the Umsdos file system S: Listopadska 7 S: 10000 Zagreb S: Croatia N: Jonathan Naylor E: g4klx@g4klx.demon.co.uk E: g4klx@amsat.org W: http://zone.pspt.fi/~jsn/ D: AX.25, NET/ROM and ROSE amateur radio protocol suites D: CCITT X.25 PLP and LAPB. S: 24 Castle View Drive S: Cromford S: Matlock S: Derbyshire DE4 3RL S: United Kingdom N: Ian S. Nelson E: nelsonis@earthlink.net P: 1024D/00D3D983 3EFD 7B86 B888 D7E2 29B6 9E97 576F 1B97 00D3 D983 D: Minor mmap and ide hacks S: 1370 Atlantis Ave. S: Lafayette CO, 80026 S: USA N: Russell Nelson E: nelson@crynwr.com W: http://www.crynwr.com/~nelson P: 1024/83942741 FF 68 EE 27 A0 5A AA C3 F5 DC 05 62 BD 5B 20 2F D: Author of cs89x0, maintainer of kernel changelog through 1.3.3 D: Wrote many packet drivers, from which some Ethernet drivers are derived. S: 521 Pleasant Valley Road S: Potsdam, New York 13676 S: USA N: Dave Neuer E: dave.neuer@pobox.com D: Helped implement support for Compaq's H31xx series iPAQs D: Other mostly minor tweaks & bugfixes N: Michael Neuffer E: mike@i-Connect.Net E: neuffer@goofy.zdv.uni-mainz.de W: http://www.i-Connect.Net/~mike/ D: Developer and maintainer of the EATA-DMA SCSI driver D: Co-developer EATA-PIO SCSI driver D: /proc/scsi and assorted other snippets S: Zum Schiersteiner Grund 2 S: 55127 Mainz S: Germany N: Gustavo Niemeyer E: niemeyer@conectiva.com W: https://moin.conectiva.com.br/GustavoNiemeyer D: wl3501 PCMCIA wireless card initial support for wireless extensions in 2.4 S: Conectiva S.A. S: R. Tocantins 89 S: 80050-430 Curitiba PR S: Brazil N: David C. Niemi E: niemi@tux.org W: http://www.tux.org/~niemi/ D: Assistant maintainer of Mtools, fdutils, and floppy driver D: Administrator of Tux.Org Linux Server, http://www.tux.org S: 2364 Old Trail Drive S: Reston, Virginia 20191 S: USA N: Fredrik Noring E: noring@nocrew.org W: http://www.lysator.liu.se/~noring/ D: dsp56k device driver N: Michael O'Reilly E: michael@iinet.com.au E: oreillym@tartarus.uwa.edu.au D: Wrote the original dynamic sized disk cache stuff. I think the only D: part that remains is the GFP_KERNEL et al #defines. :) S: 192 Nichsolson Road S: Subiaco, 6008 S: Perth, Western Australia S: Australia N: Miguel Ojeda Sandonis E: miguel.ojeda.sandonis@gmail.com W: http://miguelojeda.es W: http://jair.lab.fi.uva.es/~migojed/ D: Author of the ks0108, cfag12864b and cfag12864bfb auxiliary display drivers. D: Maintainer of the auxiliary display drivers tree (drivers/auxdisplay/*) S: C/ Mieses 20, 9-B S: Valladolid 47009 S: Spain N: Gadi Oxman E: gadio@netvision.net.il D: Original author and maintainer of IDE/ATAPI floppy/tape drivers N: Greg Page E: gpage@sovereign.org D: IPX development and support N: David Parsons E: orc@pell.chi.il.us D: improved memory detection code. N: Ivan Passos E: ivan@cyclades.com D: Author of the Cyclades-PC300 synchronous card driver D: Maintainer of the Cyclom-Y/Cyclades-Z asynchronous card driver S: Cyclades Corp S: 41934 Christy St S: Fremont, CA 94538 S: USA N: Mikulas Patocka E: mikulas@artax.karlin.mff.cuni.cz W: http://artax.karlin.mff.cuni.cz/~mikulas/ P: 1024/BB11D2D5 A0 F1 28 4A C4 14 1E CF 92 58 7A 8F 69 BC A4 D3 D: Read/write HPFS filesystem S: Weissova 8 S: 644 00 Brno S: Czech Republic N: Vojtech Pavlik E: vojtech@suse.cz D: Joystick driver D: arcnet-hardware readme D: Minor ARCnet hacking D: USB (HID, ACM, Printer ...) S: Ucitelska 1576 S: Prague 8 S: 182 00 Czech Republic N: Rick Payne D: RFC2385 Support for TCP N: Barak A. Pearlmutter E: bap@cs.unm.edu W: http://www.cs.unm.edu/~bap/ P: 512/602D785D 9B A1 83 CD EE CB AD 93 20 C6 4C B7 F5 E9 60 D4 D: Author of mark-and-sweep GC integrated by Alan Cox S: Computer Science Department S: FEC 313 S: University of New Mexico S: Albuquerque, New Mexico 87131 S: USA N: Avery Pennarun E: apenwarr@worldvisions.ca W: http://www.worldvisions.ca/~apenwarr/ D: ARCnet driver D: "make xconfig" improvements D: Various minor hacking S: RR #5, 497 Pole Line Road S: Thunder Bay, Ontario S: CANADA P7C 5M9 N: Inaky Perez-Gonzalez E: inaky.perez-gonzalez@intel.com E: linux-wimax@intel.com E: inakypg@yahoo.com D: WiMAX stack D: Intel Wireless WiMAX Connection 2400 driver N: Yuri Per E: yuri@pts.mipt.ru D: Some smbfs fixes S: Demonstratsii 8-382 S: Tula 300000 S: Russia N: Inaky Perez-Gonzalez E: inaky.perez-gonzalez@intel.com D: UWB stack, HWA-RC driver and HWA-HC drivers D: Wireless USB additions to the USB stack D: WiMedia Link Protocol bits and pieces N: Gordon Peters E: GordPeters@smarttech.com D: Isochronous receive for IEEE 1394 driver (OHCI module). D: Bugfixes for the aforementioned. S: Calgary, Alberta S: Canada N: Johnnie Peters E: jpeters@phx.mcd.mot.com D: Motorola PowerPC changes for PReP S: 2900 S. Diable Way S: Tempe, Arizona 85282 S: USA N: Kirk Petersen E: kirk@speakeasy.org W: http://www.speakeasy.org/~kirk/ D: implemented kmod D: modularized BSD Unix domain sockets N: Martin Kasper Petersen E: mkp@mkp.net D: PA-RISC port D: XFS file system D: kiobuf based block I/O work S: 314 Frank St. S: Ottawa, Ontario S: Canada K2P 0X8 N: Mikael Pettersson E: mikpe@it.uu.se W: http://user.it.uu.se/~mikpe/linux/ D: Miscellaneous fixes N: Reed H. Petty E: rhp@draper.net W: http://www.draper.net D: Loop device driver extensions D: Encryption transfer modules (no export) S: Post Office Box 1815 S: Harrison, Arkansas 72602-1815 S: USA N: Kai Petzke E: petzke@teltarif.de W: http://www.teltarif.de/ P: 1024/B42868C1 D9 59 B9 98 BB 93 05 38 2E 3E 31 79 C3 65 5D E1 D: Driver for Laser Magnetic Storage CD-ROM D: Some kernel bug fixes D: Port of the database Postgres D: Book: "Linux verstehen und anwenden" (Hanser-Verlag) S: Triftstra=DFe 55 S: 13353 Berlin S: Germany N: Emanuel Pirker E: epirker@edu.uni-klu.ac.at D: AIC5800 IEEE 1394, RAW I/O on 1394 D: Starter of Linux1394 effort S: ask per mail for current address N: Nicolas Pitre E: nico@fluxnic.net D: StrongARM SA1100 support integrator & hacker D: Xscale PXA architecture D: unified SMC 91C9x/91C11x ethernet driver (smc91x) S: Montreal, Quebec, Canada N: Ken Pizzini E: ken@halcyon.com D: CDROM driver "sonycd535" (Sony CDU-535/531) N: Stelian Pop E: stelian@popies.net P: 1024D/EDBB6147 7B36 0E07 04BC 11DC A7A0 D3F7 7185 9E7A EDBB 6147 D: random kernel hacks S: Paimpont, France N: Pete Popov E: pete_popov@yahoo.com D: Linux/MIPS AMD/Alchemy Port and mips hacking and debugging S: San Jose, CA 95134 S: USA N: Matt Porter E: mporter@kernel.crashing.org D: Motorola PowerPC PReP support D: cPCI PowerPC support D: Embedded PowerPC 4xx/6xx/7xx/74xx support S: Chandler, Arizona 85249 S: USA N: Frederic Potter E: fpotter@cirpack.com D: Some PCI kernel support N: Rui Prior E: rprior@inescn.pt D: ATM device driver for NICStAR based cards N: Stefan Probst E: sp@caldera.de D: The Linux Support Team Erlangen, 1993-97 S: Caldera (Deutschland) GmbH S: Lazarettstrasse 8 S: 91054 Erlangen S: Germany N: Giuliano Procida E: myxie@debian.org,gprocida@madge.com D: Madge Ambassador driver (Collage 155 Server ATM adapter) D: Madge Horizon driver (Collage 25 and 155 Client ATM adapters) P: 1024/93898735 D3 9E F4 F7 6D 8D 2F 3A 38 BA 06 7C 2B 33 43 7D S: Madge Networks S: Framewood Road S: Wexham SL3 6PJ S: United Kingdom N: Daniel Quinlan E: quinlan@pathname.com W: http://www.pathname.com/~quinlan/ D: FSSTND coordinator; FHS editor D: random Linux documentation, patches, and hacks S: 4390 Albany Drive #41A S: San Jose, California 95129 S: USA N: Juan Quintela E: quintela@fi.udc.es D: Memory Management hacking S: LFCIA S: Departamento de Computacion S: Universidade da Coruna S: E-15071 S: A Coruna S: Spain N: Augusto Cesar Radtke E: bishop@sekure.org W: http://bishop.sekure.org D: {copy,get,put}_user calls updates D: Miscellaneous hacks S: R. Otto Marquardt, 226 - Garcia S: 89020-350 Blumenau - Santa Catarina S: Brazil N: Goutham Rao E: goutham.rao@intel.com D: Linux/IA-64 S: 2200 Mission College Blvd S: Santa Clara, CA 95052 S: USA N: Eric S. Raymond E: esr@thyrsus.com W: http://www.tuxedo.org/~esr/ D: terminfo master file maintainer D: Editor: Installation HOWTO, Distributions HOWTO, XFree86 HOWTO D: Author: fetchmail, Emacs VC mode, Emacs GUD mode S: 6 Karen Drive S: Malvern, Pennsylvania 19355 S: USA N: Stefan Reinauer E: stepan@linux.de W: http://www.freiburg.linux.de/~stepan/ D: Modularization of some filesystems D: /proc/sound, minor fixes S: Schlossbergring 9 S: 79098 Freiburg S: Germany N: Thomas Renninger E: trenn@suse.de D: cpupowerutils S: SUSE Linux GmbH S: Germany N: Joerg Reuter E: jreuter@yaina.de W: http://yaina.de/jreuter/ W: http://www.qsl.net/dl1bke/ D: Generic Z8530 driver, AX.25 DAMA slave implementation D: Several AX.25 hacks N: Francois-Rene Rideau E: fare@tunes.org W: http://www.tunes.org/~fare D: petty kernel janitor (byteorder, ufs) S: 6, rue Augustin Thierry S: 75019 Paris S: France N: Rik van Riel E: riel@redhat.com W: http://www.surriel.com/ D: Linux-MM site, Documentation/sysctl/*, swap/mm readaround D: kswapd fixes, random kernel hacker, rmap VM, D: nl.linux.org administrator, minor scheduler additions S: Red Hat Boston S: 3 Lan Drive S: Westford, MA 01886 S: USA N: Pekka Riikonen E: priikone@poseidon.pspt.fi E: priikone@ssh.com D: Random kernel hacking and bug fixes D: International kernel patch project S: Kasarmikatu 11 A4 S: 70110 Kuopio S: Finland N: Tobias Ringstrom E: tori@unhappy.mine.nu D: Davicom DM9102(A)/DM9132/DM9801 fast ethernet driver N: Luca Risolia E: luca.risolia@studio.unibo.it P: 1024D/FCE635A4 88E8 F32F 7244 68BA 3958 5D40 99DA 5D2A FCE6 35A4 D: V4L driver for W996[87]CF JPEG USB Dual Mode Camera Chips D: V4L2 driver for SN9C10x PC Camera Controllers D: V4L2 driver for ET61X151 and ET61X251 PC Camera Controllers D: V4L2 driver for ZC0301 Image Processor and Control Chip S: Via Liberta' 41/A S: Osio Sotto, 24046, Bergamo S: Italy N: William E. Roadcap E: roadcapw@cfw.com W: http://www.cfw.com/~roadcapw D: Author of menu based configuration tool, Menuconfig. S: 1407 Broad Street S: Waynesboro, Virginia 22980 S: USA N: Andrew J. Robinson E: arobinso@nyx.net W: http://www.nyx.net/~arobinso D: Hayes ESP serial port driver N: Florian La Roche E: rzsfl@rz.uni-sb.de E: flla@stud.uni-sb.de D: Net programs and kernel net hacker S: Gaildorfer Str. 27 S: 7000 Stuttgart 50 S: Germany N: Christoph Rohland E: hans-christoph.rohland@sap.com E: ch.rohland@gmx.net D: shm fs, SYSV semaphores, af_unix S: Neue Heimat Str. 8 S: D-68789 St.Leon-Rot S: Germany N: Thiago Berlitz Rondon E: maluco@mileniumnet.com.br W: http://vivaldi.linuxms.com.br/~maluco D: Miscellaneous kernel hacker S: R. Anhanguera, 1487 - Ipiranga S: 79080-740 - Campo Grande - Mato Grosso do Sul S: Brazil N: Stephen Rothwell E: sfr@canb.auug.org.au W: http://www.canb.auug.org.au/~sfr P: 1024/BD8C7805 CD A4 9D 01 10 6E 7E 3B 91 88 FA D9 C8 40 AA 02 D: Boot/setup/build work for setup > 2K D: Author, APM driver D: Directory notification S: 66 Maltby Circuit S: Wanniassa ACT 2903 S: Australia N: Gerard Roudier E: groudier@free.fr D: Contributed to asynchronous read-ahead improvement S: 21 Rue Carnot S: 95170 Deuil La Barre S: France N: Sebastien Rougeaux E: Sebastien.Rougeaux@syseng.anu.edu.au D: IEEE 1394 OHCI module S: Research School of Information Science and Engineering S: The Australian National University, ACT 0200 S: Australia N: Aristeu Sergio Rozanski Filho E: aris@cathedrallabs.org D: Support for EtherExpress 10 ISA (i82595) in eepro driver D: User level driver support for input S: R. Jose Serrato, 130 - Santa Candida S: 82640-320 - Curitiba - Parana S: Brazil N: Alessandro Rubini E: rubini@ipvvis.unipv.it D: the gpm mouse server and kernel support for it N: Philipp Rumpf E: prumpf@tux.org D: random bugfixes S: Drausnickstrasse 29 S: 91052 Erlangen S: Germany N: Paul `Rusty' Russell E: rusty@rustcorp.com.au W: http://ozlabs.org/~rusty D: Ruggedly handsome. D: netfilter, ipchains with Michael Neuling. S: 52 Moore St S: Turner ACT 2612 S: Australia N: Richard Russon (FlatCap) E: kernel@flatcap.org W: http://www.flatcap.org D: NTFS support D: LDM support (Win2000/XP Logical Disk Manager/Dynamic Disks) S: 50 Swansea Road S: Reading S: United Kingdom N: Bill Ryder E: bryder@sgi.com D: FTDI_SIO usb/serial converter driver W: http://reality.sgi.com/bryder_wellington/ftdi_sio S: I/3 Walter St S: Wellington S: New Zealand N: Sampo Saaristo E: sambo@cs.tut.fi D: Co-author of Multi-Protocol Over ATM (MPOA) S: Tampere University of Technology / Telecom lab S: Hermiankatu 12C S: FIN-33720 Tampere S: Finland N: Thomas Sailer E: t.sailer@alumni.ethz.ch E: HB9JNX@HB9W.CHE.EU (packet radio) D: Baycom driver S: Markusstrasse 18 S: 8006 Zuerich S: Switzerland N: Manuel Estrada Sainz D: Firmware loader (request_firmware) N: Wayne Salamon E: wsalamon@tislabs.com E: wsalamon@nai.com D: portions of the Linux Security Module (LSM) framework and security modules N: Robert Sanders E: gt8134b@prism.gatech.edu D: Dosemu N: Duncan Sands E: duncan.sands@free.fr W: http://topo.math.u-psud.fr/~sands D: Alcatel SpeedTouch USB driver S: 69 rue Dunois S: 75013 Paris S: France N: Dipankar Sarma E: dipankar@in.ibm.com D: RCU N: Hannu Savolainen E: hannu@opensound.com D: Maintainer of the sound drivers until 2.1.x days. D: Original compressed boot image support. S: Valurink. 4A11 S: 03600 Karkkila S: Finland N: Deepak Saxena E: dsaxena@plexity.net D: I2O kernel layer (config, block, core, pci, net). I2O disk support for LILO D: XScale(IOP, IXP) porting and other random ARM bits S: Portland, OR N: Eric Schenk E: Eric.Schenk@dna.lth.se D: Random kernel debugging. D: SYSV Semaphore code rewrite. D: Network layer debugging. D: Dial on demand facility (diald). S: Dag Hammerskjolds v. 3E S: S-226 64 LUND S: Sweden N: Henning P. Schmiedehausen E: hps@tanstaafl.de D: added PCI support to the serial driver S: Buckenhof, Germany N: Michael Schmitz E: D: Macintosh IDE Driver N: Peter De Schrijver E: stud11@cc4.kuleuven.ac.be D: Mitsumi CD-ROM driver patches March version S: Molenbaan 29 S: B2240 Zandhoven S: Belgium N: Martin Schulze E: joey@linux.de W: http://home.pages.de/~joey/ D: Random Linux Hacker, Linux Promoter D: CD-List, Books-List, Ex-FAQ D: Linux-Support, -Mailbox, -Stammtisch D: several improvements to system programs S: Oldenburg S: Germany N: Robert Schwebel E: robert@schwebel.de W: http://www.schwebel.de D: Embedded hacker and book author, D: AMD Elan support for Linux S: Pengutronix S: Braunschweiger Strasse 79 S: 31134 Hildesheim S: Germany N: Darren Senn E: sinster@darkwater.com D: Whatever I notice needs doing (so far: itimers, /proc) S: Post Office Box 64132 S: Sunnyvale, California 94088-4132 S: USA N: Stas Sergeev E: stsp@users.sourceforge.net D: PCM PC-Speaker driver D: misc fixes S: Russia N: Simon Shapiro E: shimon@i-Connect.Net W: http://www.-i-Connect.Net/~shimon D: SCSI debugging D: Maintainer of the Debian Kernel packages S: 14355 SW Allen Blvd., Suite #140 S: Beaverton, Oregon 97008 S: USA N: Mike Shaver E: shaver@hungry.org W: http://www.hungry.org/~shaver/ D: MIPS work, /proc/sys/net, misc net hacking S: 149 Union St. S: Kingston, Ontario S: Canada K7L 2P4 N: John Shifflett E: john@geolog.com E: jshiffle@netcom.com D: Always IN2000 SCSI driver D: wd33c93 SCSI driver (linux-m68k) S: San Jose, California S: USA N: Robert Siemer E: Robert.Siemer@gmx.de P: 2048/C99A4289 2F DC 17 2E 56 62 01 C8 3D F2 AC 09 F2 E5 DD EE D: miroSOUND PCM20 radio RDS driver, ACI rewrite S: Klosterweg 28 / i309 S: 76131 Karlsruhe S: Germany N: James Simmons E: jsimmons@infradead.org E: jsimmons@users.sf.net D: Frame buffer device maintainer D: input layer development D: tty/console layer D: various mipsel devices S: 115 Carmel Avenue S: El Cerrito CA 94530 S: USA N: Jaspreet Singh E: jaspreet@sangoma.com W: www.sangoma.com D: WANPIPE drivers & API Support for Sangoma S508/FT1 cards S: Sangoma Technologies Inc., S: 1001 Denison Street S: Suite 101 S: Markham, Ontario L3R 2Z6 S: Canada N: Rick Sladkey E: jrs@world.std.com D: utility hacker: Emacs, NFS server, mount, kmem-ps, UPS debugger, strace, GDB D: library hacker: RPC, profil(3), realpath(3), regexp.h D: kernel hacker: unnamed block devs, NFS client, fast select, precision timer S: 24 Avon Place S: Arlington, Massachusetts 02174 S: USA N: Craig Small E: csmall@triode.apana.org.au E: vk2xlz@gonzo.vk2xlz.ampr.org (packet radio) D: Gracilis PackeTwin device driver D: RSPF daemon S: 10 Stockalls Place S: Minto, NSW, 2566 S: Australia N: Stephen Smalley E: sds@tycho.nsa.gov D: portions of the Linux Security Module (LSM) framework and security modules N: Chris Smith E: csmith@convex.com D: Read only HPFS filesystem S: Richardson, Texas S: USA N: Christopher Smith E: x@xman.org D: Tulip net driver hacker N: Mark Smith E: mark.smith@comdev.cc D: Multicast support in bonding driver N: Miquel van Smoorenburg E: miquels@cistron.nl D: Kernel and net hacker. Sysvinit, minicom. doing Debian stuff. S: Cistron Internet Services S: PO-Box 297 S: 2400 AG, Alphen aan den Rijn S: The Netherlands N: Scott Snyder E: snyder@fnald0.fnal.gov D: ATAPI cdrom driver S: MS 352, Fermilab S: Post Office Box 500 S: Batavia, Illinois 60510 S: USA N: Leo Spiekman E: leo@netlabs.net W: http://www.netlabs.net/hp/leo/ D: Optics Storage 8000AT cdrom driver S: Cliffwood, New Jersey 07721 S: USA N: Manfred Spraul E: manfred@colorfullife.com W: http://www.colorfullife.com/~manfred D: Lots of tiny hacks. Larger improvements to SysV IPC msg, D: slab, pipe, select. S: 71701 Schwieberdingen S: Germany N: Andrew Stanley-Jones E: asj@lanmedia.com D: LanMedia Corp. Device WAN card device driver S: #102, 686 W. Maude Ave S: Sunyvale, CA 94086 S: USA N: Michael Still E: mikal@stillhq.com W: http://www.stillhq.com D: Various janitorial patches D: mandocs and mandocs_install build targets S: (Email me and ask) S: Australia N: Henrik Storner E: storner@image.dk W: http://www.image.dk/~storner/ W: http://www.sslug.dk/ D: Configure script: Invented tristate for module-configuration D: vfat/msdos integration, kerneld docs, Linux promotion D: Miscellaneous bug-fixes S: Chr. Winthersvej 1 B, st.th. S: DK-1860 Frederiksberg C S: Denmark N: Drew Sullivan E: drew@ss.org W: http://www.ss.org/ P: 1024/ACFFA969 5A 9C 42 AB E4 24 82 31 99 56 00 BF D3 2B 25 46 D: iBCS2 developer S: 22 Irvington Cres. S: Willowdale, Ontario S: Canada M2N 2Z1 N: Adam Sulmicki E: adam@cfar.umd.edu W: http://www.eax.com D: core networking fixes D: patch-kernel enhancements D: misc kernel fixes and updates N: Adrian Sun E: asun@cobaltnet.com D: hfs support D: alpha rtc port, random appletalk fixes S: Department of Zoology, University of Washington S: Seattle, WA 98195-1800 S: USA N: Eugene Surovegin E: ebs@ebshome.net W: http://kernel.ebshome.net/ P: 1024D/AE5467F1 FF22 39F1 6728 89F6 6E6C 2365 7602 F33D AE54 67F1 D: Embedded PowerPC 4xx: EMAC, I2C, PIC and random hacks/fixes S: Sunnyvale, California 94085 S: USA N: Corey Thomas E: corey@world.std.com W: http://world.std.com/~corey/index.html D: Raylink/WebGear wireless LAN device driver (ray_cs) author S: 145 Howard St. S: Northborough, MA 01532 S: USA N: Tommy Thorn E: Tommy.Thorn@irisa.fr W: http://www.irisa.fr/prive/thorn/index.html P: 512/B4AFC909 BC BF 6D B1 52 26 1E D6 E3 2F A3 24 2A 84 FE 21 D: Device driver hacker (aha1542 & plip) S: IRISA S: Universit=E9 de Rennes I S: F-35042 Rennes Cedex S: France N: Urs Thuermann E: urs.thuermann@volkswagen.de W: http://www.volkswagen.de D: Controller Area Network (network layer core) S: Brieffach 1776 S: 38436 Wolfsburg S: Germany N: Jon Tombs E: jon@gte.esi.us.es W: http://www.esi.us.es/~jon D: NFS mmap() D: XF86_S3 D: Kernel modules D: Parts of various other programs (xfig, open, ...) S: C/ Federico Garcia Lorca 1 10-A S: Sevilla 41005 S: Spain N: Linus Torvalds E: torvalds@linux-foundation.org D: Original kernel hacker S: Portland, Oregon 97005 S: USA N: Marcelo Tosatti E: marcelo@kvack.org D: v2.4 kernel maintainer S: Brazil N: Stefan Traby E: stefan@quant-x.com D: Minor Alpha kernel hacks S: Mitterlasznitzstr. 13 S: 8302 Nestelbach S: Austria N: Jeff Tranter E: tranter@pobox.com D: Enhancements to Joystick driver D: Author of Sound HOWTO and CD-ROM HOWTO D: Author of several small utilities D: (bogomips, scope, eject, statserial) S: 1 Laurie Court S: Kanata, Ontario S: Canada K2L 1S2 N: Andrew Tridgell E: tridge@samba.org W: http://samba.org/tridge/ D: dosemu, networking, samba S: 3 Ballow Crescent S: MacGregor A.C.T 2615 S: Australia N: Josh Triplett E: josh@freedesktop.org P: 1024D/D0FE7AFB B24A 65C9 1D71 2AC2 DE87 CA26 189B 9946 D0FE 7AFB D: rcutorture maintainer D: lock annotations, finding and fixing lock bugs N: Winfried Trumper E: winni@xpilot.org W: http://www.shop.de/~winni/ D: German HOWTO, Crash-Kurs Linux (German, 100 comprehensive pages) D: CD-Writing HOWTO, various mini-HOWTOs D: One-week tutorials on Linux twice a year (free of charge) D: Linux-Workshop Koln (aka LUG Cologne, Germany), Installfests S: Tacitusstr. 6 S: D-50968 Koln N: Tsu-Sheng Tsao E: tsusheng@scf.usc.edu D: IGMP(Internet Group Management Protocol) version 2 S: 2F 14 ALY 31 LN 166 SEC 1 SHIH-PEI RD S: Taipei S: Taiwan 112 S: Republic of China S: 24335 Delta Drive S: Diamond Bar, California 91765 S: USA N: Theodore Ts'o E: tytso@mit.edu D: Random Linux hacker D: Maintainer of tsx-11.mit.edu ftp archive D: Maintainer of c.o.l.* Usenet<->mail gateway D: Author of serial driver D: Author of the new e2fsck D: Author of job control and system call restart code D: Author of ramdisk device driver D: Author of loopback device driver D: Author of /dev/random driver S: MIT Room E40-343 S: 1 Amherst Street S: Cambridge, Massachusetts 02139 S: USA N: Simmule Turner E: sturner@tele-tv.com D: Added swapping to filesystem S: 4226 Landgreen Street S: Rockville, Maryland 20853 S: USA N: Stephen Tweedie E: sct@redhat.com P: 1024/E7A417AD E2 FE A4 20 34 EC ED FC 7D 7E 67 8D E0 31 D1 69 P: 1024D/43BE7544 D2A4 8556 08E6 90E7 076C BA3F 243F 20A4 43BE 7544 D: Second extended file system developer D: General filesystem hacker D: kswap vm management code S: 44 Campbell Park Crescent S: Edinburgh EH13 0HT S: United Kingdom N: Thomas Uhl E: uhl@sun1.rz.fh-heilbronn.de D: Application programmer D: Linux promoter D: Author of a German book on Linux S: Obere Heerbergstrasse 17 S: 97078 Wuerzburg S: Germany N: Greg Ungerer E: gerg@snapgear.com D: uClinux kernel hacker D: Port uClinux to the Motorola ColdFire CPU D: Author of Stallion multiport serial drivers S: SnapGear Inc. S: 825 Stanley St S: Woolloongabba. QLD. 4102 S: Australia N: Jeffrey A. Uphoff E: juphoff@transmeta.com E: jeff.uphoff@linux.org P: 1024/9ED505C5 D7 BB CA AA 10 45 40 1B 16 19 0A C0 38 A0 3E CB D: Linux Security/Alert mailing lists' moderator/maintainer. D: NSM (rpc.statd) developer. D: PAM S/Key module developer. D: 'dip' contributor. D: AIPS port, astronomical community support. S: Transmeta Corporation S: 2540 Mission College Blvd. S: Santa Clara, CA 95054 S: USA N: Matthias Urlichs E: smurf@smurf.noris.de E: smurf@debian.org E: matthias@urlichs.de D: Consultant, developer, kernel hacker D: In a previous life, worked on Streams/ISDN/BSD networking code for Linux S: Schleiermacherstrasse 12 S: 90491 Nuernberg S: Germany N: Geert Uytterhoeven E: geert@linux-m68k.org W: http://users.telenet.be/geertu/ P: 1024/862678A6 C51D 361C 0BD1 4C90 B275 C553 6EEA 11BA 8626 78A6 D: m68k/Amiga and PPC/CHRP Longtrail coordinator D: Frame buffer device and XF68_FBDev maintainer D: m68k IDE maintainer D: Amiga Zorro maintainer D: Amiga Buddha and Catweasel chipset IDE D: Atari Falcon chipset IDE D: Amiga Gayle chipset IDE D: mipsel NEC DDB Vrc-5074 S: Haterbeekstraat 55B S: B-3200 Aarschot S: Belgium N: Chris Vance E: cvance@tislabs.com E: cvance@nai.com D: portions of the Linux Security Module (LSM) framework and security modules N: Petr Vandrovec E: petr@vandrovec.name D: Small contributions to ncpfs D: Matrox framebuffer driver S: 21513 Conradia Ct S: Cupertino, CA 95014 S: USA N: Thibaut Varene E: T-Bone@parisc-linux.org W: http://www.parisc-linux.org/~varenet/ P: 1024D/B7D2F063 E67C 0D43 A75E 12A5 BB1C FA2F 1E32 C3DA B7D2 F063 D: PA-RISC port minion, PDC and GSCPS2 drivers, debuglocks and other bits D: Some ARM at91rm9200 bits, S1D13XXX FB driver, random patches here and there D: AD1889 sound driver S: Paris, France N: Heikki Vatiainen E: hessu@cs.tut.fi D: Co-author of Multi-Protocol Over ATM (MPOA), some LANE hacks S: Tampere University of Technology / Telecom lab S: Hermiankatu 12C S: FIN-33720 Tampere S: Finland N: Andrew Veliath E: andrewtv@usa.net D: Turtle Beach MultiSound sound driver S: USA N: Dirk Verworner D: Co-author of German book ``Linux-Kernel-Programmierung'' D: Co-founder of Berlin Linux User Group N: Riku Voipio E: riku.voipio@iki.fi D: Author of PCA9532 LED and Fintek f75375s hwmon driver D: Some random ARM board patches S: Finland N: Patrick Volkerding E: volkerdi@ftp.cdrom.com D: Produced the Slackware distribution, updated the SVGAlib D: patches for ghostscript, worked on color 'ls', etc. S: 301 15th Street S. S: Moorhead, Minnesota 56560 S: USA N: Jos Vos E: jos@xos.nl W: http://www.xos.nl/ D: Various IP firewall updates, ipfwadm S: X/OS Experts in Open Systems BV S: Kruislaan 419 S: 1098 VA Amsterdam S: The Netherlands N: Jeroen Vreeken E: pe1rxq@amsat.org W: http://www.chello.nl/~j.vreeken/ D: SE401 usb webcam driver D: ZD1201 usb wireless lan driver S: Maastrichterweg 63 S: 5554 GG Valkenswaard S: The Netherlands N: Mark Wallis E: mwallis@serialmonkey.com W: http://mark.serialmonkey.com D: Ralink rt2x00 WLAN driver S: Newcastle, Australia N: Peter Shaobo Wang E: pwang@mmdcorp.com W: http://www.mmdcorp.com/pw/linux D: Driver for Interphase ATM (i)Chip SAR adapter card family (x575, x525, x531). S: 1513 Brewster Dr. S: Carrollton, TX 75010 S: USA N: Tim Waugh E: tim@cyberelk.net D: Co-architect of the parallel-port sharing system S: 17 Curling Vale S: GUILDFORD S: Surrey S: GU2 7PJ S: United Kingdom N: Juergen Weigert E: jnweiger@immd4.informatik.uni-erlangen.de D: The Linux Support Team Erlangen N: David Weinehall E: tao@acc.umu.se P: 1024D/DC47CA16 7ACE 0FB0 7A74 F994 9B36 E1D1 D14E 8526 DC47 CA16 W: http://www.acc.umu.se/~tao/ D: v2.0 kernel maintainer D: Fixes for the NE/2-driver D: Miscellaneous MCA-support D: Cleanup of the Config-files N: Matt Welsh E: mdw@metalab.unc.edu W: http://www.cs.berkeley.edu/~mdw D: Original Linux Documentation Project coordinator D: Author, "Running Linux" (O'Reilly) D: Author, "Linux Installation and Getting Started" (LDP) and several HOWTOs D: Linuxdoc-SGML formatting system (now SGML-Tools) D: Device drivers for various high-speed network interfaces (Myrinet, ATM) D: Keithley DAS1200 device driver D: Original maintainer of sunsite WWW and FTP sites D: Original moderator of c.o.l.announce and c.o.l.answers S: Computer Science Division S: UC Berkeley S: Berkeley, CA 94720-1776 S: USA N: Harald Welte E: laforge@netfilter.org P: 1024D/30F48BFF DBDE 6912 8831 9A53 879B 9190 5DA5 C655 30F4 8BFF W: http://gnumonks.org/users/laforge D: netfilter: new nat helper infrastructure D: netfilter: ULOG, ECN, DSCP target D: netfilter: TTL match D: netfilter: IPv6 mangle table D: netfilter: various other hacks S: Berlin S: Germany N: Bill Wendling E: wendling@ganymede.isdn.uiuc.edu W: http://www.ncsa.uiuc.edu/~wendling/ D: Various random hacks. Mostly on poll/select logic. S: 605 E. Springfield Ave. S: Champaign, IL 61820 S: USA N: Mike Westall D: IBM Turboways 25 ATM Device Driver E: westall@cs.clemson.edu S: Department of Computer Science S: Clemson University S: Clemson SC 29634 USA N: Greg Wettstein E: greg@wind.rmcc.com D: Filesystem valid flag for MINIX filesystem. D: Minor kernel debugging. D: Development and maintenance of sysklogd. D: Monitoring of development kernels for long-term stability. D: Early implementations of Linux in a commercial environment. S: Dr. Greg Wettstein, Ph.D. S: Oncology Research Division Computing Facility S: Roger Maris Cancer Center S: 820 4th St. N. S: Fargo, North Dakota 58122 S: USA N: Steven Whitehouse E: steve@chygwyn.com W: http://www.chygwyn.com/~steve D: Linux DECnet project D: Minor debugging of other networking protocols. D: Misc bug fixes and GFS2 filesystem development N: Hans-Joachim Widmaier E: hjw@zvw.de D: AFFS rewrite S: Eichenweg 16 S: 73650 Winterbach S: Germany N: Urban Widmark E: urban@svenskatest.se D: via-rhine, misc net driver hacking N: Marco van Wieringen E: mvw@planets.elm.net D: Author of process accounting and diskquota S: Breeburgsingel 12 S: 2135 CN Hoofddorp S: The Netherlands N: Matthew Wilcox E: matthew@wil.cx W: ftp://ftp.uk.linux.org/pub/linux/people/willy/ D: Linux/PARISC hacker. Filesystem hacker. Random other hacking. Custom D: PPC port hacking. N: G\"unter Windau E: gunter@mbfys.kun.nl D: Some bug fixes in the polling printer driver (lp.c) S: University of Nijmegen S: Geert-Grooteplein Noord 21 S: 6525 EZ Nijmegen S: The Netherlands N: Ulrich Windl E: Ulrich.Windl@rz.uni-regensburg.de P: 1024/E843660D CF D7 43 A1 5A 49 14 25 7C 04 A0 6E 4C 3A AC 6D D: Supports NTP on Linux. Added PPS code. Fixed bugs in adjtimex(). S: Alte Regensburger Str. 11a S: 93149 Nittenau S: Germany N: Gertjan van Wingerde E: gwingerde@gmail.com D: Ralink rt2x00 WLAN driver D: Minix V2 file-system D: Misc fixes S: Geessinkweg 177 S: 7544 TX Enschede S: The Netherlands N: Lars Wirzenius E: liw@iki.fi D: Linux System Administrator's Guide, author, former maintainer D: comp.os.linux.announce, former moderator D: Linux Documentation Project, co-founder D: Original sprintf in kernel D: Original kernel README (for version 0.97) D: Linux News (electronic magazine, now dead), founder and former editor D: Meta-FAQ, originator, former maintainer D: INFO-SHEET, former maintainer D: Author of the longest-living linux bug N: Jonathan Woithe E: jwoithe@physics.adelaide.edu.au W: http://www.physics.adelaide.edu.au/~jwoithe D: ALS-007 sound card extensions to Sound Blaster driver S: 20 Jordan St S: Valley View, SA 5093 S: Australia N: Clifford Wolf E: god@clifford.at W: http://www.clifford.at/ D: Menuconfig/lxdialog improvement S: Foehrengasse 16 S: A-2333 Leopoldsdorf b. Wien S: Austria N: Roger E. Wolff E: R.E.Wolff@BitWizard.nl D: Written kmalloc/kfree D: Written Specialix IO8+ driver D: Written Specialix SX driver S: van Bronckhorststraat 12 S: 2612 XV Delft S: The Netherlands N: Thomas Woller D: CS461x Cirrus Logic sound driver N: David Woodhouse E: dwmw2@infradead.org D: JFFS2 file system, Memory Technology Device subsystem, D: various other stuff that annoyed me by not working. S: c/o Intel Corporation S: Pipers Way S: Swindon. SN3 1RJ S: England N: Chris Wright E: chrisw@sous-sol.org D: hacking on LSM framework and security modules. S: Portland, OR S: USA N: Michal Wronski E: michal.wronski@gmail.com D: POSIX message queues fs (with K. Benedyczak) S: Krakow S: Poland N: Frank Xia E: qx@math.columbia.edu D: Xiafs filesystem [defunct] S: 542 West 112th Street, 5N S: New York, New York 10025 S: USA N: Li Yang E: leoli@freescale.com D: Freescale Highspeed USB device driver D: Freescale QE SoC support and Ethernet driver S: B-1206 Jingmao Guojigongyu S: 16 Baliqiao Nanjie, Beijing 101100 S: People's Repulic of China N: Victor Yodaiken E: yodaiken@fsmlabs.com D: RTLinux (RealTime Linux) S: POB 1822 S: Socorro NM, 87801 S: USA N: Hiroshi YOKOTA E: yokota@netlab.is.tsukuba.ac.jp D: Workbit NinjaSCSI-3/32Bi PCMCIA driver D: Workbit NinjaSCSI-32Bi/UDE driver S: Japan N: Hideaki YOSHIFUJI E: hideaki@yoshifuji.org E: yoshfuji@linux-ipv6.org W: http://www.yoshifuji.org/~hideaki/ P: 1024D/E0620EEA 9022 65EB 1ECF 3AD1 0BDF 80D8 4807 F894 E062 0EEA D: IPv6 and other networking related stuff D: USAGI/WIDE Project, Keio University S: Jeunet Palace Kawasaki #1-201, 10-2, Furukawa-cho, Saiwai-ku S: Kawasaki, Kanagawa 212-0025 S: Japan N: Eric Youngdale E: eric@andante.org W: http://www.andante.org D: General kernel hacker D: SCSI iso9660 and ELF S: 6389 Hawk View Lane S: Alexandria, Virginia 22312 S: USA N: Niibe Yutaka E: gniibe@mri.co.jp D: PLIP driver D: Asynchronous socket I/O in the NET code S: Mitsubishi Research Institute, Inc. S: ARCO Tower 1-8-1 Shimomeguro Meguro-ku S: Tokyo 153 S: Japan N: James R. Van Zandt E: jrv@vanzandt.mv.com P: 1024/E298966D F0 37 4F FD E5 7E C5 E6 F1 A0 1E 22 6F 46 DA 0C D: Author and maintainer of the Double Talk speech synthesizer driver S: 27 Spencer Drive S: Nashua, New Hampshire 03062 S: USA N: Orest Zborowski E: orestz@eskimo.com D: XFree86 and kernel development S: 1507 145th Place SE #B5 S: Bellevue, Washington 98007 S: USA N: Richard Zidlicky E: rz@linux-m68k.org, rdzidlic@geocities.com W: http://www.geocities.com/rdzidlic D: Q40 port - see arch/m68k/q40/README D: various m68k hacks S: Germany N: Werner Zimmermann E: Werner.Zimmermann@fht-esslingen.de D: CDROM driver "aztcd" (Aztech/Okano/Orchid/Wearnes) S: Flandernstrasse 101 S: D-73732 Esslingen S: Germany N: Roman Zippel E: zippel@linux-m68k.org D: AFFS and HFS filesystems, m68k maintainer, new kernel configuration in 2.5 N: Leonard N. Zubkoff W: http://www.dandelion.com/Linux/ D: BusLogic SCSI driver D: Mylex DAC960 PCI RAID driver D: Miscellaneous kernel fixes N: Alessandro Zummo E: a.zummo@towertech.it D: CMI8330 support is sb_card.c D: ISAPnP fixes in sb_card.c D: ZyXEL omni.net lcd plus driver D: RTC subsystem S: Italy N: Marc Zyngier E: maz@wild-wind.fr.eu.org W: http://www.misterjones.org D: MD driver D: EISA/sysfs subsystem S: France # Don't add your name here, unless you really _are_ after Marc # alphabetically. Leonard used to be very proud of being the # last entry, and he'll get positively pissed if he can't even # be second-to-last. (and this file really _is_ supposed to be # in alphabetic order) Linux kernel release 3.x These are the release notes for Linux version 3. Read them carefully, as they tell you what this is all about, explain how to install the kernel, and what to do if something goes wrong. WHAT IS LINUX? Linux is a clone of the operating system Unix, written from scratch by Linus Torvalds with assistance from a loosely-knit team of hackers across the Net. It aims towards POSIX and Single UNIX Specification compliance. It has all the features you would expect in a modern fully-fledged Unix, including true multitasking, virtual memory, shared libraries, demand loading, shared copy-on-write executables, proper memory management, and multistack networking including IPv4 and IPv6. It is distributed under the GNU General Public License - see the accompanying COPYING file for more details. ON WHAT HARDWARE DOES IT RUN? Although originally developed first for 32-bit x86-based PCs (386 or higher), today Linux also runs on (at least) the Compaq Alpha AXP, Sun SPARC and UltraSPARC, Motorola 68000, PowerPC, PowerPC64, ARM, Hitachi SuperH, Cell, IBM S/390, MIPS, HP PA-RISC, Intel IA-64, DEC VAX, AMD x86-64, AXIS CRIS, Xtensa, Tilera TILE, AVR32 and Renesas M32R architectures. Linux is easily portable to most general-purpose 32- or 64-bit architectures as long as they have a paged memory management unit (PMMU) and a port of the GNU C compiler (gcc) (part of The GNU Compiler Collection, GCC). Linux has also been ported to a number of architectures without a PMMU, although functionality is then obviously somewhat limited. Linux has also been ported to itself. You can now run the kernel as a userspace application - this is called UserMode Linux (UML). DOCUMENTATION: - There is a lot of documentation available both in electronic form on the Internet and in books, both Linux-specific and pertaining to general UNIX questions. I'd recommend looking into the documentation subdirectories on any Linux FTP site for the LDP (Linux Documentation Project) books. This README is not meant to be documentation on the system: there are much better sources available. - There are various README files in the Documentation/ subdirectory: these typically contain kernel-specific installation notes for some drivers for example. See Documentation/00-INDEX for a list of what is contained in each file. Please read the Changes file, as it contains information about the problems, which may result by upgrading your kernel. - The Documentation/DocBook/ subdirectory contains several guides for kernel developers and users. These guides can be rendered in a number of formats: PostScript (.ps), PDF, HTML, & man-pages, among others. After installation, "make psdocs", "make pdfdocs", "make htmldocs", or "make mandocs" will render the documentation in the requested format. INSTALLING the kernel source: - If you install the full sources, put the kernel tarball in a directory where you have permissions (eg. your home directory) and unpack it: gzip -cd linux-3.X.tar.gz | tar xvf - or bzip2 -dc linux-3.X.tar.bz2 | tar xvf - Replace "XX" with the version number of the latest kernel. Do NOT use the /usr/src/linux area! This area has a (usually incomplete) set of kernel headers that are used by the library header files. They should match the library, and not get messed up by whatever the kernel-du-jour happens to be. - You can also upgrade between 3.x releases by patching. Patches are distributed in the traditional gzip and the newer bzip2 format. To install by patching, get all the newer patch files, enter the top level directory of the kernel source (linux-3.x) and execute: gzip -cd ../patch-3.x.gz | patch -p1 or bzip2 -dc ../patch-3.x.bz2 | patch -p1 (repeat xx for all versions bigger than the version of your current source tree, _in_order_) and you should be ok. You may want to remove the backup files (xxx~ or xxx.orig), and make sure that there are no failed patches (xxx# or xxx.rej). If there are, either you or me has made a mistake. Unlike patches for the 3.x kernels, patches for the 3.x.y kernels (also known as the -stable kernels) are not incremental but instead apply directly to the base 3.x kernel. Please read Documentation/applying-patches.txt for more information. Alternatively, the script patch-kernel can be used to automate this process. It determines the current kernel version and applies any patches found. linux/scripts/patch-kernel linux The first argument in the command above is the location of the kernel source. Patches are applied from the current directory, but an alternative directory can be specified as the second argument. - If you are upgrading between releases using the stable series patches (for example, patch-3.x.y), note that these "dot-releases" are not incremental and must be applied to the 3.x base tree. For example, if your base kernel is 3.0 and you want to apply the 3.0.3 patch, you do not and indeed must not first apply the 3.0.1 and 3.0.2 patches. Similarly, if you are running kernel version 3.0.2 and want to jump to 3.0.3, you must first reverse the 3.0.2 patch (that is, patch -R) _before_ applying the 3.0.3 patch. You can read more on this in Documentation/applying-patches.txt - Make sure you have no stale .o files and dependencies lying around: cd linux make mrproper You should now have the sources correctly installed. SOFTWARE REQUIREMENTS Compiling and running the 3.x kernels requires up-to-date versions of various software packages. Consult Documentation/Changes for the minimum version numbers required and how to get updates for these packages. Beware that using excessively old versions of these packages can cause indirect errors that are very difficult to track down, so don't assume that you can just update packages when obvious problems arise during build or operation. BUILD directory for the kernel: When compiling the kernel all output files will per default be stored together with the kernel source code. Using the option "make O=output/dir" allow you to specify an alternate place for the output files (including .config). Example: kernel source code: /usr/src/linux-3.N build directory: /home/name/build/kernel To configure and build the kernel use: cd /usr/src/linux-3.N make O=/home/name/build/kernel menuconfig make O=/home/name/build/kernel sudo make O=/home/name/build/kernel modules_install install Please note: If the 'O=output/dir' option is used then it must be used for all invocations of make. CONFIGURING the kernel: Do not skip this step even if you are only upgrading one minor version. New configuration options are added in each release, and odd problems will turn up if the configuration files are not set up as expected. If you want to carry your existing configuration to a new version with minimal work, use "make oldconfig", which will only ask you for the answers to new questions. - Alternate configuration commands are: "make config" Plain text interface. "make menuconfig" Text based color menus, radiolists & dialogs. "make nconfig" Enhanced text based color menus. "make xconfig" X windows (Qt) based configuration tool. "make gconfig" X windows (Gtk) based configuration tool. "make oldconfig" Default all questions based on the contents of your existing ./.config file and asking about new config symbols. "make silentoldconfig" Like above, but avoids cluttering the screen with questions already answered. Additionally updates the dependencies. "make defconfig" Create a ./.config file by using the default symbol values from either arch/$ARCH/defconfig or arch/$ARCH/configs/${PLATFORM}_defconfig, depending on the architecture. "make ${PLATFORM}_defconfig" Create a ./.config file by using the default symbol values from arch/$ARCH/configs/${PLATFORM}_defconfig. Use "make help" to get a list of all available platforms of your architecture. "make allyesconfig" Create a ./.config file by setting symbol values to 'y' as much as possible. "make allmodconfig" Create a ./.config file by setting symbol values to 'm' as much as possible. "make allnoconfig" Create a ./.config file by setting symbol values to 'n' as much as possible. "make randconfig" Create a ./.config file by setting symbol values to random values. You can find more information on using the Linux kernel config tools in Documentation/kbuild/kconfig.txt. NOTES on "make config": - having unnecessary drivers will make the kernel bigger, and can under some circumstances lead to problems: probing for a nonexistent controller card may confuse your other controllers - compiling the kernel with "Processor type" set higher than 386 will result in a kernel that does NOT work on a 386. The kernel will detect this on bootup, and give up. - A kernel with math-emulation compiled in will still use the coprocessor if one is present: the math emulation will just never get used in that case. The kernel will be slightly larger, but will work on different machines regardless of whether they have a math coprocessor or not. - the "kernel hacking" configuration details usually result in a bigger or slower kernel (or both), and can even make the kernel less stable by configuring some routines to actively try to break bad code to find kernel problems (kmalloc()). Thus you should probably answer 'n' to the questions for "development", "experimental", or "debugging" features. COMPILING the kernel: - Make sure you have at least gcc 3.2 available. For more information, refer to Documentation/Changes. Please note that you can still run a.out user programs with this kernel. - Do a "make" to create a compressed kernel image. It is also possible to do "make install" if you have lilo installed to suit the kernel makefiles, but you may want to check your particular lilo setup first. To do the actual install you have to be root, but none of the normal build should require that. Don't take the name of root in vain. - If you configured any of the parts of the kernel as `modules', you will also have to do "make modules_install". - Verbose kernel compile/build output: Normally the kernel build system runs in a fairly quiet mode (but not totally silent). However, sometimes you or other kernel developers need to see compile, link, or other commands exactly as they are executed. For this, use "verbose" build mode. This is done by inserting "V=1" in the "make" command. E.g.: make V=1 all To have the build system also tell the reason for the rebuild of each target, use "V=2". The default is "V=0". - Keep a backup kernel handy in case something goes wrong. This is especially true for the development releases, since each new release contains new code which has not been debugged. Make sure you keep a backup of the modules corresponding to that kernel, as well. If you are installing a new kernel with the same version number as your working kernel, make a backup of your modules directory before you do a "make modules_install". Alternatively, before compiling, use the kernel config option "LOCALVERSION" to append a unique suffix to the regular kernel version. LOCALVERSION can be set in the "General Setup" menu. - In order to boot your new kernel, you'll need to copy the kernel image (e.g. .../linux/arch/i386/boot/bzImage after compilation) to the place where your regular bootable kernel is found. - Booting a kernel directly from a floppy without the assistance of a bootloader such as LILO, is no longer supported. If you boot Linux from the hard drive, chances are you use LILO which uses the kernel image as specified in the file /etc/lilo.conf. The kernel image file is usually /vmlinuz, /boot/vmlinuz, /bzImage or /boot/bzImage. To use the new kernel, save a copy of the old image and copy the new image over the old one. Then, you MUST RERUN LILO to update the loading map!! If you don't, you won't be able to boot the new kernel image. Reinstalling LILO is usually a matter of running /sbin/lilo. You may wish to edit /etc/lilo.conf to specify an entry for your old kernel image (say, /vmlinux.old) in case the new one does not work. See the LILO docs for more information. After reinstalling LILO, you should be all set. Shutdown the system, reboot, and enjoy! If you ever need to change the default root device, video mode, ramdisk size, etc. in the kernel image, use the 'rdev' program (or alternatively the LILO boot options when appropriate). No need to recompile the kernel to change these parameters. - Reboot with the new kernel and enjoy. IF SOMETHING GOES WRONG: - If you have problems that seem to be due to kernel bugs, please check the file MAINTAINERS to see if there is a particular person associated with the part of the kernel that you are having trouble with. If there isn't anyone listed there, then the second best thing is to mail them to me (torvalds@linux-foundation.org), and possibly to any other relevant mailing-list or to the newsgroup. - In all bug-reports, *please* tell what kernel you are talking about, how to duplicate the problem, and what your setup is (use your common sense). If the problem is new, tell me so, and if the problem is old, please try to tell me when you first noticed it. - If the bug results in a message like unable to handle kernel paging request at address C0000010 Oops: 0002 EIP: 0010:XXXXXXXX eax: xxxxxxxx ebx: xxxxxxxx ecx: xxxxxxxx edx: xxxxxxxx esi: xxxxxxxx edi: xxxxxxxx ebp: xxxxxxxx ds: xxxx es: xxxx fs: xxxx gs: xxxx Pid: xx, process nr: xx xx xx xx xx xx xx xx xx xx xx or similar kernel debugging information on your screen or in your system log, please duplicate it *exactly*. The dump may look incomprehensible to you, but it does contain information that may help debugging the problem. The text above the dump is also important: it tells something about why the kernel dumped code (in the above example it's due to a bad kernel pointer). More information on making sense of the dump is in Documentation/oops-tracing.txt - If you compiled the kernel with CONFIG_KALLSYMS you can send the dump as is, otherwise you will have to use the "ksymoops" program to make sense of the dump (but compiling with CONFIG_KALLSYMS is usually preferred). This utility can be downloaded from ftp://ftp..kernel.org/pub/linux/utils/kernel/ksymoops/ . Alternately you can do the dump lookup by hand: - In debugging dumps like the above, it helps enormously if you can look up what the EIP value means. The hex value as such doesn't help me or anybody else very much: it will depend on your particular kernel setup. What you should do is take the hex value from the EIP line (ignore the "0010:"), and look it up in the kernel namelist to see which kernel function contains the offending address. To find out the kernel function name, you'll need to find the system binary associated with the kernel that exhibited the symptom. This is the file 'linux/vmlinux'. To extract the namelist and match it against the EIP from the kernel crash, do: nm vmlinux | sort | less This will give you a list of kernel addresses sorted in ascending order, from which it is simple to find the function that contains the offending address. Note that the address given by the kernel debugging messages will not necessarily match exactly with the function addresses (in fact, that is very unlikely), so you can't just 'grep' the list: the list will, however, give you the starting point of each kernel function, so by looking for the function that has a starting address lower than the one you are searching for but is followed by a function with a higher address you will find the one you want. In fact, it may be a good idea to include a bit of "context" in your problem report, giving a few lines around the interesting one. If you for some reason cannot do the above (you have a pre-compiled kernel image or similar), telling me as much about your setup as possible will help. Please read the REPORTING-BUGS document for details. - Alternately, you can use gdb on a running kernel. (read-only; i.e. you cannot change values or set break points.) To do this, first compile the kernel with -g; edit arch/i386/Makefile appropriately, then do a "make clean". You'll also need to enable CONFIG_PROC_FS (via "make config"). After you've rebooted with the new kernel, do "gdb vmlinux /proc/kcore". You can now use all the usual gdb commands. The command to look up the point where your system crashed is "l *0xXXXXXXXX". (Replace the XXXes with the EIP value.) gdb'ing a non-running kernel currently fails because gdb (wrongly) disregards the starting offset for which the kernel is compiled. Applying Patches To The Linux Kernel ------------------------------------ Original by: Jesper Juhl, August 2005 Last update: 2006-01-05 A frequently asked question on the Linux Kernel Mailing List is how to apply a patch to the kernel or, more specifically, what base kernel a patch for one of the many trees/branches should be applied to. Hopefully this document will explain this to you. In addition to explaining how to apply and revert patches, a brief description of the different kernel trees (and examples of how to apply their specific patches) is also provided. What is a patch? --- A patch is a small text document containing a delta of changes between two different versions of a source tree. Patches are created with the `diff' program. To correctly apply a patch you need to know what base it was generated from and what new version the patch will change the source tree into. These should both be present in the patch file metadata or be possible to deduce from the filename. How do I apply or revert a patch? --- You apply a patch with the `patch' program. The patch program reads a diff (or patch) file and makes the changes to the source tree described in it. Patches for the Linux kernel are generated relative to the parent directory holding the kernel source dir. This means that paths to files inside the patch file contain the name of the kernel source directories it was generated against (or some other directory names like "a/" and "b/"). Since this is unlikely to match the name of the kernel source dir on your local machine (but is often useful info to see what version an otherwise unlabeled patch was generated against) you should change into your kernel source directory and then strip the first element of the path from filenames in the patch file when applying it (the -p1 argument to `patch' does this). To revert a previously applied patch, use the -R argument to patch. So, if you applied a patch like this: patch -p1 < ../patch-x.y.z You can revert (undo) it like this: patch -R -p1 < ../patch-x.y.z How do I feed a patch/diff file to `patch'? --- This (as usual with Linux and other UNIX like operating systems) can be done in several different ways. In all the examples below I feed the file (in uncompressed form) to patch via stdin using the following syntax: patch -p1 < path/to/patch-x.y.z If you just want to be able to follow the examples below and don't want to know of more than one way to use patch, then you can stop reading this section here. Patch can also get the name of the file to use via the -i argument, like this: patch -p1 -i path/to/patch-x.y.z If your patch file is compressed with gzip or bzip2 and you don't want to uncompress it before applying it, then you can feed it to patch like this instead: zcat path/to/patch-x.y.z.gz | patch -p1 bzcat path/to/patch-x.y.z.bz2 | patch -p1 If you wish to uncompress the patch file by hand first before applying it (what I assume you've done in the examples below), then you simply run gunzip or bunzip2 on the file -- like this: gunzip patch-x.y.z.gz bunzip2 patch-x.y.z.bz2 Which will leave you with a plain text patch-x.y.z file that you can feed to patch via stdin or the -i argument, as you prefer. A few other nice arguments for patch are -s which causes patch to be silent except for errors which is nice to prevent errors from scrolling out of the screen too fast, and --dry-run which causes patch to just print a listing of what would happen, but doesn't actually make any changes. Finally --verbose tells patch to print more information about the work being done. Common errors when patching --- When patch applies a patch file it attempts to verify the sanity of the file in different ways. Checking that the file looks like a valid patch file & checking the code around the bits being modified matches the context provided in the patch are just two of the basic sanity checks patch does. If patch encounters something that doesn't look quite right it has two options. It can either refuse to apply the changes and abort or it can try to find a way to make the patch apply with a few minor changes. One example of something that's not 'quite right' that patch will attempt to fix up is if all the context matches, the lines being changed match, but the line numbers are different. This can happen, for example, if the patch makes a change in the middle of the file but for some reasons a few lines have been added or removed near the beginning of the file. In that case everything looks good it has just moved up or down a bit, and patch will usually adjust the line numbers and apply the patch. Whenever patch applies a patch that it had to modify a bit to make it fit it'll tell you about it by saying the patch applied with 'fuzz'. You should be wary of such changes since even though patch probably got it right it doesn't /always/ get it right, and the result will sometimes be wrong. When patch encounters a change that it can't fix up with fuzz it rejects it outright and leaves a file with a .rej extension (a reject file). You can read this file to see exactly what change couldn't be applied, so you can go fix it up by hand if you wish. If you don't have any third-party patches applied to your kernel source, but only patches from kernel.org and you apply the patches in the correct order, and have made no modifications yourself to the source files, then you should never see a fuzz or reject message from patch. If you do see such messages anyway, then there's a high risk that either your local source tree or the patch file is corrupted in some way. In that case you should probably try re-downloading the patch and if things are still not OK then you'd be advised to start with a fresh tree downloaded in full from kernel.org. Let's look a bit more at some of the messages patch can produce. If patch stops and presents a "File to patch:" prompt, then patch could not find a file to be patched. Most likely you forgot to specify -p1 or you are in the wrong directory. Less often, you'll find patches that need to be applied with -p0 instead of -p1 (reading the patch file should reveal if this is the case -- if so, then this is an error by the person who created the patch but is not fatal). If you get "Hunk #2 succeeded at 1887 with fuzz 2 (offset 7 lines)." or a message similar to that, then it means that patch had to adjust the location of the change (in this example it needed to move 7 lines from where it expected to make the change to make it fit). The resulting file may or may not be OK, depending on the reason the file was different than expected. This often happens if you try to apply a patch that was generated against a different kernel version than the one you are trying to patch. If you get a message like "Hunk #3 FAILED at 2387.", then it means that the patch could not be applied correctly and the patch program was unable to fuzz its way through. This will generate a .rej file with the change that caused the patch to fail and also a .orig file showing you the original content that couldn't be changed. If you get "Reversed (or previously applied) patch detected! Assume -R? [n]" then patch detected that the change contained in the patch seems to have already been made. If you actually did apply this patch previously and you just re-applied it in error, then just say [n]o and abort this patch. If you applied this patch previously and actually intended to revert it, but forgot to specify -R, then you can say [y]es here to make patch revert it for you. This can also happen if the creator of the patch reversed the source and destination directories when creating the patch, and in that case reverting the patch will in fact apply it. A message similar to "patch: **** unexpected end of file in patch" or "patch unexpectedly ends in middle of line" means that patch could make no sense of the file you fed to it. Either your download is broken, you tried to feed patch a compressed patch file without uncompressing it first, or the patch file that you are using has been mangled by a mail client or mail transfer agent along the way somewhere, e.g., by splitting a long line into two lines. Often these warnings can easily be fixed by joining (concatenating) the two lines that had been split. As I already mentioned above, these errors should never happen if you apply a patch from kernel.org to the correct version of an unmodified source tree. So if you get these errors with kernel.org patches then you should probably assume that either your patch file or your tree is broken and I'd advise you to start over with a fresh download of a full kernel tree and the patch you wish to apply. Are there any alternatives to `patch'? --- Yes there are alternatives. You can use the `interdiff' program (http://cyberelk.net/tim/patchutils/) to generate a patch representing the differences between two patches and then apply the result. This will let you move from something like 2.6.12.2 to 2.6.12.3 in a single step. The -z flag to interdiff will even let you feed it patches in gzip or bzip2 compressed form directly without the use of zcat or bzcat or manual decompression. Here's how you'd go from 2.6.12.2 to 2.6.12.3 in a single step: interdiff -z ../patch-2.6.12.2.bz2 ../patch-2.6.12.3.gz | patch -p1 Although interdiff may save you a step or two you are generally advised to do the additional steps since interdiff can get things wrong in some cases. Another alternative is `ketchup', which is a python script for automatic downloading and applying of patches (http://www.selenic.com/ketchup/). Other nice tools are diffstat, which shows a summary of changes made by a patch; lsdiff, which displays a short listing of affected files in a patch file, along with (optionally) the line numbers of the start of each patch; and grepdiff, which displays a list of the files modified by a patch where the patch contains a given regular expression. Where can I download the patches? --- The patches are available at http://kernel.org/ Most recent patches are linked from the front page, but they also have specific homes. The 2.6.x.y (-stable) and 2.6.x patches live at ftp://ftp.kernel.org/pub/linux/kernel/v2.6/ The -rc patches live at ftp://ftp.kernel.org/pub/linux/kernel/v2.6/testing/ The -git patches live at ftp://ftp.kernel.org/pub/linux/kernel/v2.6/snapshots/ The -mm kernels live at ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/ In place of ftp.kernel.org you can use ftp.cc.kernel.org, where cc is a country code. This way you'll be downloading from a mirror site that's most likely geographically closer to you, resulting in faster downloads for you, less bandwidth used globally and less load on the main kernel.org servers -- these are good things, so do use mirrors when possible. The 2.6.x kernels --- These are the base stable releases released by Linus. The highest numbered release is the most recent. If regressions or other serious flaws are found, then a -stable fix patch will be released (see below) on top of this base. Once a new 2.6.x base kernel is released, a patch is made available that is a delta between the previous 2.6.x kernel and the new one. To apply a patch moving from 2.6.11 to 2.6.12, you'd do the following (note that such patches do *NOT* apply on top of 2.6.x.y kernels but on top of the base 2.6.x kernel -- if you need to move from 2.6.x.y to 2.6.x+1 you need to first revert the 2.6.x.y patch). Here are some examples: # moving from 2.6.11 to 2.6.12 $ cd ~/linux-2.6.11 # change to kernel source dir $ patch -p1 < ../patch-2.6.12 # apply the 2.6.12 patch $ cd .. $ mv linux-2.6.11 linux-2.6.12 # rename source dir # moving from 2.6.11.1 to 2.6.12 $ cd ~/linux-2.6.11.1 # change to kernel source dir $ patch -p1 -R < ../patch-2.6.11.1 # revert the 2.6.11.1 patch # source dir is now 2.6.11 $ patch -p1 < ../patch-2.6.12 # apply new 2.6.12 patch $ cd .. $ mv linux-2.6.11.1 linux-2.6.12 # rename source dir The 2.6.x.y kernels --- Kernels with 4-digit versions are -stable kernels. They contain small(ish) critical fixes for security problems or significant regressions discovered in a given 2.6.x kernel. This is the recommended branch for users who want the most recent stable kernel and are not interested in helping test development/experimental versions. If no 2.6.x.y kernel is available, then the highest numbered 2.6.x kernel is the current stable kernel. note: the -stable team usually do make incremental patches available as well as patches against the latest mainline release, but I only cover the non-incremental ones below. The incremental ones can be found at ftp://ftp.kernel.org/pub/linux/kernel/v2.6/incr/ These patches are not incremental, meaning that for example the 2.6.12.3 patch does not apply on top of the 2.6.12.2 kernel source, but rather on top of the base 2.6.12 kernel source . So, in order to apply the 2.6.12.3 patch to your existing 2.6.12.2 kernel source you have to first back out the 2.6.12.2 patch (so you are left with a base 2.6.12 kernel source) and then apply the new 2.6.12.3 patch. Here's a small example: $ cd ~/linux-2.6.12.2 # change into the kernel source dir $ patch -p1 -R < ../patch-2.6.12.2 # revert the 2.6.12.2 patch $ patch -p1 < ../patch-2.6.12.3 # apply the new 2.6.12.3 patch $ cd .. $ mv linux-2.6.12.2 linux-2.6.12.3 # rename the kernel source dir The -rc kernels --- These are release-candidate kernels. These are development kernels released by Linus whenever he deems the current git (the kernel's source management tool) tree to be in a reasonably sane state adequate for testing. These kernels are not stable and you should expect occasional breakage if you intend to run them. This is however the most stable of the main development branches and is also what will eventually turn into the next stable kernel, so it is important that it be tested by as many people as possible. This is a good branch to run for people who want to help out testing development kernels but do not want to run some of the really experimental stuff (such people should see the sections about -git and -mm kernels below). The -rc patches are not incremental, they apply to a base 2.6.x kernel, just like the 2.6.x.y patches described above. The kernel version before the -rcN suffix denotes the version of the kernel that this -rc kernel will eventually turn into. So, 2.6.13-rc5 means that this is the fifth release candidate for the 2.6.13 kernel and the patch should be applied on top of the 2.6.12 kernel source. Here are 3 examples of how to apply these patches: # first an example of moving from 2.6.12 to 2.6.13-rc3 $ cd ~/linux-2.6.12 # change into the 2.6.12 source dir $ patch -p1 < ../patch-2.6.13-rc3 # apply the 2.6.13-rc3 patch $ cd .. $ mv linux-2.6.12 linux-2.6.13-rc3 # rename the source dir # now let's move from 2.6.13-rc3 to 2.6.13-rc5 $ cd ~/linux-2.6.13-rc3 # change into the 2.6.13-rc3 dir $ patch -p1 -R < ../patch-2.6.13-rc3 # revert the 2.6.13-rc3 patch $ patch -p1 < ../patch-2.6.13-rc5 # apply the new 2.6.13-rc5 patch $ cd .. $ mv linux-2.6.13-rc3 linux-2.6.13-rc5 # rename the source dir # finally let's try and move from 2.6.12.3 to 2.6.13-rc5 $ cd ~/linux-2.6.12.3 # change to the kernel source dir $ patch -p1 -R < ../patch-2.6.12.3 # revert the 2.6.12.3 patch $ patch -p1 < ../patch-2.6.13-rc5 # apply new 2.6.13-rc5 patch $ cd .. $ mv linux-2.6.12.3 linux-2.6.13-rc5 # rename the kernel source dir The -git kernels --- These are daily snapshots of Linus' kernel tree (managed in a git repository, hence the name). These patches are usually released daily and represent the current state of Linus's tree. They are more experimental than -rc kernels since they are generated automatically without even a cursory glance to see if they are sane. -git patches are not incremental and apply either to a base 2.6.x kernel or a base 2.6.x-rc kernel -- you can see which from their name. A patch named 2.6.12-git1 applies to the 2.6.12 kernel source and a patch named 2.6.13-rc3-git2 applies to the source of the 2.6.13-rc3 kernel. Here are some examples of how to apply these patches: # moving from 2.6.12 to 2.6.12-git1 $ cd ~/linux-2.6.12 # change to the kernel source dir $ patch -p1 < ../patch-2.6.12-git1 # apply the 2.6.12-git1 patch $ cd .. $ mv linux-2.6.12 linux-2.6.12-git1 # rename the kernel source dir # moving from 2.6.12-git1 to 2.6.13-rc2-git3 $ cd ~/linux-2.6.12-git1 # change to the kernel source dir $ patch -p1 -R < ../patch-2.6.12-git1 # revert the 2.6.12-git1 patch # we now have a 2.6.12 kernel $ patch -p1 < ../patch-2.6.13-rc2 # apply the 2.6.13-rc2 patch # the kernel is now 2.6.13-rc2 $ patch -p1 < ../patch-2.6.13-rc2-git3 # apply the 2.6.13-rc2-git3 patch # the kernel is now 2.6.13-rc2-git3 $ cd .. $ mv linux-2.6.12-git1 linux-2.6.13-rc2-git3 # rename source dir The -mm kernels --- These are experimental kernels released by Andrew Morton. The -mm tree serves as a sort of proving ground for new features and other experimental patches. Once a patch has proved its worth in -mm for a while Andrew pushes it on to Linus for inclusion in mainline. Although it's encouraged that patches flow to Linus via the -mm tree, this is not always enforced. Subsystem maintainers (or individuals) sometimes push their patches directly to Linus, even though (or after) they have been merged and tested in -mm (or sometimes even without prior testing in -mm). You should generally strive to get your patches into mainline via -mm to ensure maximum testing. This branch is in constant flux and contains many experimental features, a lot of debugging patches not appropriate for mainline etc., and is the most experimental of the branches described in this document. These kernels are not appropriate for use on systems that are supposed to be stable and they are more risky to run than any of the other branches (make sure you have up-to-date backups -- that goes for any experimental kernel but even more so for -mm kernels). These kernels in addition to all the other experimental patches they contain usually also contain any changes in the mainline -git kernels available at the time of release. Testing of -mm kernels is greatly appreciated since the whole point of the tree is to weed out regressions, crashes, data corruption bugs, build breakage (and any other bug in general) before changes are merged into the more stable mainline Linus tree. But testers of -mm should be aware that breakage in this tree is more common than in any other tree. The -mm kernels are not released on a fixed schedule, but usually a few -mm kernels are released in between each -rc kernel (1 to 3 is common). The -mm kernels apply to either a base 2.6.x kernel (when no -rc kernels have been released yet) or to a Linus -rc kernel. Here are some examples of applying the -mm patches: # moving from 2.6.12 to 2.6.12-mm1 $ cd ~/linux-2.6.12 # change to the 2.6.12 source dir $ patch -p1 < ../2.6.12-mm1 # apply the 2.6.12-mm1 patch $ cd .. $ mv linux-2.6.12 linux-2.6.12-mm1 # rename the source appropriately # moving from 2.6.12-mm1 to 2.6.13-rc3-mm3 $ cd ~/linux-2.6.12-mm1 $ patch -p1 -R < ../2.6.12-mm1 # revert the 2.6.12-mm1 patch # we now have a 2.6.12 source $ patch -p1 < ../patch-2.6.13-rc3 # apply the 2.6.13-rc3 patch # we now have a 2.6.13-rc3 source $ patch -p1 < ../2.6.13-rc3-mm3 # apply the 2.6.13-rc3-mm3 patch $ cd .. $ mv linux-2.6.12-mm1 linux-2.6.13-rc3-mm3 # rename the source dir This concludes this list of explanations of the various kernel trees. I hope you are now clear on how to apply the various patches and help testing the kernel. Thank you's to Randy Dunlap, Rolf Eike Beer, Linus Torvalds, Bodo Eggert, Johannes Stezenbach, Grant Coady, Pavel Machek and others that I may have forgotten for their reviews and contributions to this document. Semantics and Behavior of Atomic and Bitmask Operations David S. Miller This document is intended to serve as a guide to Linux port maintainers on how to implement atomic counter, bitops, and spinlock interfaces properly. The atomic_t type should be defined as a signed integer. Also, it should be made opaque such that any kind of cast to a normal C integer type will fail. Something like the following should suffice: typedef struct { int counter; } atomic_t; Historically, counter has been declared volatile. This is now discouraged. See Documentation/volatile-considered-harmful.txt for the complete rationale. local_t is very similar to atomic_t. If the counter is per CPU and only updated by one CPU, local_t is probably more appropriate. Please see Documentation/local_ops.txt for the semantics of local_t. The first operations to implement for atomic_t's are the initializers and plain reads. #define ATOMIC_INIT(i) { (i) } #define atomic_set(v, i) ((v)->counter = (i)) The first macro is used in definitions, such as: static atomic_t my_counter = ATOMIC_INIT(1); The initializer is atomic in that the return values of the atomic operations are guaranteed to be correct reflecting the initialized value if the initializer is used before runtime. If the initializer is used at runtime, a proper implicit or explicit read memory barrier is needed before reading the value with atomic_read from another thread. The second interface can be used at runtime, as in: struct foo { atomic_t counter; }; ... struct foo *k; k = kmalloc(sizeof(*k), GFP_KERNEL); if (!k) return -ENOMEM; atomic_set(&k->counter, 0); The setting is atomic in that the return values of the atomic operations by all threads are guaranteed to be correct reflecting either the value that has been set with this operation or set with another operation. A proper implicit or explicit memory barrier is needed before the value set with the operation is guaranteed to be readable with atomic_read from another thread. Next, we have: #define atomic_read(v) ((v)->counter) which simply reads the counter value currently visible to the calling thread. The read is atomic in that the return value is guaranteed to be one of the values initialized or modified with the interface operations if a proper implicit or explicit memory barrier is used after possible runtime initialization by any other thread and the value is modified only with the interface operations. atomic_read does not guarantee that the runtime initialization by any other thread is visible yet, so the user of the interface must take care of that with a proper implicit or explicit memory barrier. *** WARNING: atomic_read() and atomic_set() DO NOT IMPLY BARRIERS! *** Some architectures may choose to use the volatile keyword, barriers, or inline assembly to guarantee some degree of immediacy for atomic_read() and atomic_set(). This is not uniformly guaranteed, and may change in the future, so all users of atomic_t should treat atomic_read() and atomic_set() as simple C statements that may be reordered or optimized away entirely by the compiler or processor, and explicitly invoke the appropriate compiler and/or memory barrier for each use case. Failure to do so will result in code that may suddenly break when used with different architectures or compiler optimizations, or even changes in unrelated code which changes how the compiler optimizes the section accessing atomic_t variables. *** YOU HAVE BEEN WARNED! *** Properly aligned pointers, longs, ints, and chars (and unsigned equivalents) may be atomically loaded from and stored to in the same sense as described for atomic_read() and atomic_set(). The ACCESS_ONCE() macro should be used to prevent the compiler from using optimizations that might otherwise optimize accesses out of existence on the one hand, or that might create unsolicited accesses on the other. For example consider the following code: while (a > 0) do_something(); If the compiler can prove that do_something() does not store to the variable a, then the compiler is within its rights transforming this to the following: tmp = a; if (a > 0) for (;;) do_something(); If you don't want the compiler to do this (and you probably don't), then you should use something like the following: while (ACCESS_ONCE(a) < 0) do_something(); Alternatively, you could place a barrier() call in the loop. For another example, consider the following code: tmp_a = a; do_something_with(tmp_a); do_something_else_with(tmp_a); If the compiler can prove that do_something_with() does not store to the variable a, then the compiler is within its rights to manufacture an additional load as follows: tmp_a = a; do_something_with(tmp_a); tmp_a = a; do_something_else_with(tmp_a); This could fatally confuse your code if it expected the same value to be passed to do_something_with() and do_something_else_with(). The compiler would be likely to manufacture this additional load if do_something_with() was an inline function that made very heavy use of registers: reloading from variable a could save a flush to the stack and later reload. To prevent the compiler from attacking your code in this manner, write the following: tmp_a = ACCESS_ONCE(a); do_something_with(tmp_a); do_something_else_with(tmp_a); For a final example, consider the following code, assuming that the variable a is set at boot time before the second CPU is brought online and never changed later, so that memory barriers are not needed: if (a) b = 9; else b = 42; The compiler is within its rights to manufacture an additional store by transforming the above code into the following: b = 42; if (a) b = 9; This could come as a fatal surprise to other code running concurrently that expected b to never have the value 42 if a was zero. To prevent the compiler from doing this, write something like: if (a) ACCESS_ONCE(b) = 9; else ACCESS_ONCE(b) = 42; Don't even -think- about doing this without proper use of memory barriers, locks, or atomic operations if variable a can change at runtime! *** WARNING: ACCESS_ONCE() DOES NOT IMPLY A BARRIER! *** Now, we move onto the atomic operation interfaces typically implemented with the help of assembly code. void atomic_add(int i, atomic_t *v); void atomic_sub(int i, atomic_t *v); void atomic_inc(atomic_t *v); void atomic_dec(atomic_t *v); These four routines add and subtract integral values to/from the given atomic_t value. The first two routines pass explicit integers by which to make the adjustment, whereas the latter two use an implicit adjustment value of "1". One very important aspect of these two routines is that they DO NOT require any explicit memory barriers. They need only perform the atomic_t counter update in an SMP safe manner. Next, we have: int atomic_inc_return(atomic_t *v); int atomic_dec_return(atomic_t *v); These routines add 1 and subtract 1, respectively, from the given atomic_t and return the new counter value after the operation is performed. Unlike the above routines, it is required that explicit memory barriers are performed before and after the operation. It must be done such that all memory operations before and after the atomic operation calls are strongly ordered with respect to the atomic operation itself. For example, it should behave as if a smp_mb() call existed both before and after the atomic operation. If the atomic instructions used in an implementation provide explicit memory barrier semantics which satisfy the above requirements, that is fine as well. Let's move on: int atomic_add_return(int i, atomic_t *v); int atomic_sub_return(int i, atomic_t *v); These behave just like atomic_{inc,dec}_return() except that an explicit counter adjustment is given instead of the implicit "1". This means that like atomic_{inc,dec}_return(), the memory barrier semantics are required. Next: int atomic_inc_and_test(atomic_t *v); int atomic_dec_and_test(atomic_t *v); These two routines increment and decrement by 1, respectively, the given atomic counter. They return a boolean indicating whether the resulting counter value was zero or not. It requires explicit memory barrier semantics around the operation as above. int atomic_sub_and_test(int i, atomic_t *v); This is identical to atomic_dec_and_test() except that an explicit decrement is given instead of the implicit "1". It requires explicit memory barrier semantics around the operation. int atomic_add_negative(int i, atomic_t *v); The given increment is added to the given atomic counter value. A boolean is return which indicates whether the resulting counter value is negative. It requires explicit memory barrier semantics around the operation. Then: int atomic_xchg(atomic_t *v, int new); This performs an atomic exchange operation on the atomic variable v, setting the given new value. It returns the old value that the atomic variable v had just before the operation. int atomic_cmpxchg(atomic_t *v, int old, int new); This performs an atomic compare exchange operation on the atomic value v, with the given old and new values. Like all atomic_xxx operations, atomic_cmpxchg will only satisfy its atomicity semantics as long as all other accesses of *v are performed through atomic_xxx operations. atomic_cmpxchg requires explicit memory barriers around the operation. The semantics for atomic_cmpxchg are the same as those defined for 'cas' below. Finally: int atomic_add_unless(atomic_t *v, int a, int u); If the atomic value v is not equal to u, this function adds a to v, and returns non zero. If v is equal to u then it returns zero. This is done as an atomic operation. atomic_add_unless requires explicit memory barriers around the operation unless it fails (returns 0). atomic_inc_not_zero, equivalent to atomic_add_unless(v, 1, 0) If a caller requires memory barrier semantics around an atomic_t operation which does not return a value, a set of interfaces are defined which accomplish this: void smp_mb__before_atomic_dec(void); void smp_mb__after_atomic_dec(void); void smp_mb__before_atomic_inc(void); void smp_mb__after_atomic_inc(void); For example, smp_mb__before_atomic_dec() can be used like so: obj->dead = 1; smp_mb__before_atomic_dec(); atomic_dec(&obj->ref_count); It makes sure that all memory operations preceding the atomic_dec() call are strongly ordered with respect to the atomic counter operation. In the above example, it guarantees that the assignment of "1" to obj->dead will be globally visible to other cpus before the atomic counter decrement. Without the explicit smp_mb__before_atomic_dec() call, the implementation could legally allow the atomic counter update visible to other cpus before the "obj->dead = 1;" assignment. The other three interfaces listed are used to provide explicit ordering with respect to memory operations after an atomic_dec() call (smp_mb__after_atomic_dec()) and around atomic_inc() calls (smp_mb__{before,after}_atomic_inc()). A missing memory barrier in the cases where they are required by the atomic_t implementation above can have disastrous results. Here is an example, which follows a pattern occurring frequently in the Linux kernel. It is the use of atomic counters to implement reference counting, and it works such that once the counter falls to zero it can be guaranteed that no other entity can be accessing the object: static void obj_list_add(struct obj *obj, struct list_head *head) { obj->active = 1; list_add(&obj->list, head); } static void obj_list_del(struct obj *obj) { list_del(&obj->list); obj->active = 0; } static void obj_destroy(struct obj *obj) { BUG_ON(obj->active); kfree(obj); } struct obj *obj_list_peek(struct list_head *head) { if (!list_empty(head)) { struct obj *obj; obj = list_entry(head->next, struct obj, list); atomic_inc(&obj->refcnt); return obj; } return NULL; } void obj_poke(void) { struct obj *obj; spin_lock(&global_list_lock); obj = obj_list_peek(&global_list); spin_unlock(&global_list_lock); if (obj) { obj->ops->poke(obj); if (atomic_dec_and_test(&obj->refcnt)) obj_destroy(obj); } } void obj_timeout(struct obj *obj) { spin_lock(&global_list_lock); obj_list_del(obj); spin_unlock(&global_list_lock); if (atomic_dec_and_test(&obj->refcnt)) obj_destroy(obj); } (This is a simplification of the ARP queue management in the generic neighbour discover code of the networking. Olaf Kirch found a bug wrt. memory barriers in kfree_skb() that exposed the atomic_t memory barrier requirements quite clearly.) Given the above scheme, it must be the case that the obj->active update done by the obj list deletion be visible to other processors before the atomic counter decrement is performed. Otherwise, the counter could fall to zero, yet obj->active would still be set, thus triggering the assertion in obj_destroy(). The error sequence looks like this: cpu 0 cpu 1 obj_poke() obj_timeout() obj = obj_list_peek(); ... gains ref to obj, refcnt=2 obj_list_del(obj); obj->active = 0 ... ... visibility delayed ... atomic_dec_and_test() ... refcnt drops to 1 ... atomic_dec_and_test() ... refcount drops to 0 ... obj_destroy() BUG() triggers since obj->active still seen as one obj->active update visibility occurs With the memory barrier semantics required of the atomic_t operations which return values, the above sequence of memory visibility can never happen. Specifically, in the above case the atomic_dec_and_test() counter decrement would not become globally visible until the obj->active update does. As a historical note, 32-bit Sparc used to only allow usage of 24-bits of its atomic_t type. This was because it used 8 bits as a spinlock for SMP safety. Sparc32 lacked a "compare and swap" type instruction. However, 32-bit Sparc has since been moved over to a "hash table of spinlocks" scheme, that allows the full 32-bit counter to be realized. Essentially, an array of spinlocks are indexed into based upon the address of the atomic_t being operated on, and that lock protects the atomic operation. Parisc uses the same scheme. Another note is that the atomic_t operations returning values are extremely slow on an old 386. We will now cover the atomic bitmask operations. You will find that their SMP and memory barrier semantics are similar in shape and scope to the atomic_t ops above. Native atomic bit operations are defined to operate on objects aligned to the size of an "unsigned long" C data type, and are least of that size. The endianness of the bits within each "unsigned long" are the native endianness of the cpu. void set_bit(unsigned long nr, volatile unsigned long *addr); void clear_bit(unsigned long nr, volatile unsigned long *addr); void change_bit(unsigned long nr, volatile unsigned long *addr); These routines set, clear, and change, respectively, the bit number indicated by "nr" on the bit mask pointed to by "ADDR". They must execute atomically, yet there are no implicit memory barrier semantics required of these interfaces. int test_and_set_bit(unsigned long nr, volatile unsigned long *addr); int test_and_clear_bit(unsigned long nr, volatile unsigned long *addr); int test_and_change_bit(unsigned long nr, volatile unsigned long *addr); Like the above, except that these routines return a boolean which indicates whether the changed bit was set _BEFORE_ the atomic bit operation. WARNING! It is incredibly important that the value be a boolean, ie. "0" or "1". Do not try to be fancy and save a few instructions by declaring the above to return "long" and just returning something like "old_val & mask" because that will not work. For one thing, this return value gets truncated to int in many code paths using these interfaces, so on 64-bit if the bit is set in the upper 32-bits then testers will never see that. One great example of where this problem crops up are the thread_info flag operations. Routines such as test_and_set_ti_thread_flag() chop the return value into an int. There are other places where things like this occur as well. These routines, like the atomic_t counter operations returning values, require explicit memory barrier semantics around their execution. All memory operations before the atomic bit operation call must be made visible globally before the atomic bit operation is made visible. Likewise, the atomic bit operation must be visible globally before any subsequent memory operation is made visible. For example: obj->dead = 1; if (test_and_set_bit(0, &obj->flags)) /* ... */; obj->killed = 1; The implementation of test_and_set_bit() must guarantee that "obj->dead = 1;" is visible to cpus before the atomic memory operation done by test_and_set_bit() becomes visible. Likewise, the atomic memory operation done by test_and_set_bit() must become visible before "obj->killed = 1;" is visible. Finally there is the basic operation: int test_bit(unsigned long nr, __const__ volatile unsigned long *addr); Which returns a boolean indicating if bit "nr" is set in the bitmask pointed to by "addr". If explicit memory barriers are required around clear_bit() (which does not return a value, and thus does not need to provide memory barrier semantics), two interfaces are provided: void smp_mb__before_clear_bit(void); void smp_mb__after_clear_bit(void); They are used as follows, and are akin to their atomic_t operation brothers: /* All memory operations before this call will * be globally visible before the clear_bit(). */ smp_mb__before_clear_bit(); clear_bit( ... ); /* The clear_bit() will be visible before all * subsequent memory operations. */ smp_mb__after_clear_bit(); There are two special bitops with lock barrier semantics (acquire/release, same as spinlocks). These operate in the same way as their non-_lock/unlock postfixed variants, except that they are to provide acquire/release semantics, respectively. This means they can be used for bit_spin_trylock and bit_spin_unlock type operations without specifying any more barriers. int test_and_set_bit_lock(unsigned long nr, unsigned long *addr); void clear_bit_unlock(unsigned long nr, unsigned long *addr); void __clear_bit_unlock(unsigned long nr, unsigned long *addr); The __clear_bit_unlock version is non-atomic, however it still implements unlock barrier semantics. This can be useful if the lock itself is protecting the other bits in the word. Finally, there are non-atomic versions of the bitmask operations provided. They are used in contexts where some other higher-level SMP locking scheme is being used to protect the bitmask, and thus less expensive non-atomic operations may be used in the implementation. They have names similar to the above bitmask operation interfaces, except that two underscores are prefixed to the interface name. void __set_bit(unsigned long nr, volatile unsigned long *addr); void __clear_bit(unsigned long nr, volatile unsigned long *addr); void __change_bit(unsigned long nr, volatile unsigned long *addr); int __test_and_set_bit(unsigned long nr, volatile unsigned long *addr); int __test_and_clear_bit(unsigned long nr, volatile unsigned long *addr); int __test_and_change_bit(unsigned long nr, volatile unsigned long *addr); These non-atomic variants also do not require any special memory barrier semantics. The routines xchg() and cmpxchg() need the same exact memory barriers as the atomic and bit operations returning values. Spinlocks and rwlocks have memory barrier expectations as well. The rule to follow is simple: 1) When acquiring a lock, the implementation must make it globally visible before any subsequent memory operation. 2) When releasing a lock, the implementation must make it such that all previous memory operations are globally visible before the lock release. Which finally brings us to _atomic_dec_and_lock(). There is an architecture-neutral version implemented in lib/dec_and_lock.c, but most platforms will wish to optimize this in assembler. int _atomic_dec_and_lock(atomic_t *atomic, spinlock_t *lock); Atomically decrement the given counter, and if will drop to zero atomically acquire the given spinlock and perform the decrement of the counter to zero. If it does not drop to zero, do nothing with the spinlock. It is actually pretty simple to get the memory barrier correct. Simply satisfy the spinlock grab requirements, which is make sure the spinlock operation is globally visible before any subsequent memory operation. We can demonstrate this operation more clearly if we define an abstract atomic operation: long cas(long *mem, long old, long new); "cas" stands for "compare and swap". It atomically: 1) Compares "old" with the value currently at "mem". 2) If they are equal, "new" is written to "mem". 3) Regardless, the current value at "mem" is returned. As an example usage, here is what an atomic counter update might look like: void example_atomic_inc(long *counter) { long old, new, ret; while (1) { old = *counter; new = old + 1; ret = cas(counter, old, new); if (ret == old) break; } } Let's use cas() in order to build a pseudo-C atomic_dec_and_lock(): int _atomic_dec_and_lock(atomic_t *atomic, spinlock_t *lock) { long old, new, ret; int went_to_zero; went_to_zero = 0; while (1) { old = atomic_read(atomic); new = old - 1; if (new == 0) { went_to_zero = 1; spin_lock(lock); } ret = cas(atomic, old, new); if (ret == old) break; if (went_to_zero) { spin_unlock(lock); went_to_zero = 0; } } return went_to_zero; } Now, as far as memory barriers go, as long as spin_lock() strictly orders all subsequent memory operations (including the cas()) with respect to itself, things will be fine. Said another way, _atomic_dec_and_lock() must guarantee that a counter dropping to zero is never made visible before the spinlock being acquired. Note that this also means that for the case where the counter is not dropping to zero, there are no memory ordering requirements. March 2008 Jan-Simon Moeller, dl9pf@gmx.de How to deal with bad memory e.g. reported by memtest86+ ? ######################################################### There are three possibilities I know of: 1) Reinsert/swap the memory modules 2) Buy new modules (best!) or try to exchange the memory if you have spare-parts 3) Use BadRAM or memmap This Howto is about number 3) . BadRAM ###### BadRAM is the actively developed and available as kernel-patch here: http://rick.vanrein.org/linux/badram/ For more details see the BadRAM documentation. memmap ###### memmap is already in the kernel and usable as kernel-parameter at boot-time. Its syntax is slightly strange and you may need to calculate the values by yourself! Syntax to exclude a memory area (see kernel-parameters.txt for details): memmap=$
Example: memtest86+ reported here errors at address 0x18691458, 0x18698424 and some others. All had 0x1869xxxx in common, so I chose a pattern of 0x18690000,0xffff0000. With the numbers of the example above: memmap=64K$0x18690000 or memmap=0x10000$0x18690000 These instructions are deliberately very basic. If you want something clever, go read the real docs ;-) Please don't add more stuff, but feel free to correct my mistakes ;-) (mbligh@aracnet.com) Thanks to John Levon, Dave Hansen, et al. for help writing this. is the thing you're trying to measure. Make sure you have the correct System.map / vmlinux referenced! It is probably easiest to use "make install" for linux and hack /sbin/installkernel to copy vmlinux to /boot, in addition to vmlinuz, config, System.map, which are usually installed by default. Readprofile ----------- A recent readprofile command is needed for 2.6, such as found in util-linux 2.12a, which can be downloaded from: http://www.kernel.org/pub/linux/utils/util-linux/ Most distributions will ship it already. Add "profile=2" to the kernel command line. clear readprofile -r dump output readprofile -m /boot/System.map > captured_profile Oprofile -------- Get the source (see Changes for required version) from http://oprofile.sourceforge.net/ and add "idle=poll" to the kernel command line. Configure with CONFIG_PROFILING=y and CONFIG_OPROFILE=y & reboot on new kernel ./configure --with-kernel-support make install For superior results, be sure to enable the local APIC. If opreport sees a 0Hz CPU, APIC was not on. Be aware that idle=poll may mean a performance penalty. One time setup: opcontrol --setup --vmlinux=/boot/vmlinux clear opcontrol --reset start opcontrol --start stop opcontrol --stop dump output opreport > output_file To only report on the kernel, run opreport -l /boot/vmlinux > output_file A reset is needed to clear old statistics, which survive a reboot. Kernel Support for miscellaneous (your favourite) Binary Formats v1.1 ===================================================================== This Kernel feature allows you to invoke almost (for restrictions see below) every program by simply typing its name in the shell. This includes for example compiled Java(TM), Python or Emacs programs. To achieve this you must tell binfmt_misc which interpreter has to be invoked with which binary. Binfmt_misc recognises the binary-type by matching some bytes at the beginning of the file with a magic byte sequence (masking out specified bits) you have supplied. Binfmt_misc can also recognise a filename extension aka '.com' or '.exe'. First you must mount binfmt_misc: mount binfmt_misc -t binfmt_misc /proc/sys/fs/binfmt_misc To actually register a new binary type, you have to set up a string looking like :name:type:offset:magic:mask:interpreter:flags (where you can choose the ':' upon your needs) and echo it to /proc/sys/fs/binfmt_misc/register. Here is what the fields mean: - 'name' is an identifier string. A new /proc file will be created with this name below /proc/sys/fs/binfmt_misc - 'type' is the type of recognition. Give 'M' for magic and 'E' for extension. - 'offset' is the offset of the magic/mask in the file, counted in bytes. This defaults to 0 if you omit it (i.e. you write ':name:type::magic...') - 'magic' is the byte sequence binfmt_misc is matching for. The magic string may contain hex-encoded characters like \x0a or \xA4. In a shell environment you will have to write \\x0a to prevent the shell from eating your \. If you chose filename extension matching, this is the extension to be recognised (without the '.', the \x0a specials are not allowed). Extension matching is case sensitive! - 'mask' is an (optional, defaults to all 0xff) mask. You can mask out some bits from matching by supplying a string like magic and as long as magic. The mask is anded with the byte sequence of the file. - 'interpreter' is the program that should be invoked with the binary as first argument (specify the full path) - 'flags' is an optional field that controls several aspects of the invocation of the interpreter. It is a string of capital letters, each controls a certain aspect. The following flags are supported - 'P' - preserve-argv[0]. Legacy behavior of binfmt_misc is to overwrite the original argv[0] with the full path to the binary. When this flag is included, binfmt_misc will add an argument to the argument vector for this purpose, thus preserving the original argv[0]. 'O' - open-binary. Legacy behavior of binfmt_misc is to pass the full path of the binary to the interpreter as an argument. When this flag is included, binfmt_misc will open the file for reading and pass its descriptor as an argument, instead of the full path, thus allowing the interpreter to execute non-readable binaries. This feature should be used with care - the interpreter has to be trusted not to emit the contents of the non-readable binary. 'C' - credentials. Currently, the behavior of binfmt_misc is to calculate the credentials and security token of the new process according to the interpreter. When this flag is included, these attributes are calculated according to the binary. It also implies the 'O' flag. This feature should be used with care as the interpreter will run with root permissions when a setuid binary owned by root is run with binfmt_misc. There are some restrictions: - the whole register string may not exceed 255 characters - the magic must reside in the first 128 bytes of the file, i.e. offset+size(magic) has to be less than 128 - the interpreter string may not exceed 127 characters To use binfmt_misc you have to mount it first. You can mount it with "mount -t binfmt_misc none /proc/sys/fs/binfmt_misc" command, or you can add a line "none /proc/sys/fs/binfmt_misc binfmt_misc defaults 0 0" to your /etc/fstab so it auto mounts on boot. You may want to add the binary formats in one of your /etc/rc scripts during boot-up. Read the manual of your init program to figure out how to do this right. Think about the order of adding entries! Later added entries are matched first! A few examples (assumed you are in /proc/sys/fs/binfmt_misc): - enable support for em86 (like binfmt_em86, for Alpha AXP only): echo ':i386:M::\x7fELF\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x03:\xff\xff\xff\xff\xff\xfe\xfe\xff\xff\xff\xff\xff\xff\xff\xff\xff\xfb\xff\xff:/bin/em86:' > register echo ':i486:M::\x7fELF\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x06:\xff\xff\xff\xff\xff\xfe\xfe\xff\xff\xff\xff\xff\xff\xff\xff\xff\xfb\xff\xff:/bin/em86:' > register - enable support for packed DOS applications (pre-configured dosemu hdimages): echo ':DEXE:M::\x0eDEX::/usr/bin/dosexec:' > register - enable support for Windows executables using wine: echo ':DOSWin:M::MZ::/usr/local/bin/wine:' > register For java support see Documentation/java.txt You can enable/disable binfmt_misc or one binary type by echoing 0 (to disable) or 1 (to enable) to /proc/sys/fs/binfmt_misc/status or /proc/.../the_name. Catting the file tells you the current status of binfmt_misc/the entry. You can remove one entry or all entries by echoing -1 to /proc/.../the_name or /proc/sys/fs/binfmt_misc/status. HINTS: ====== If you want to pass special arguments to your interpreter, you can write a wrapper script for it. See Documentation/java.txt for an example. Your interpreter should NOT look in the PATH for the filename; the kernel passes it the full filename (or the file descriptor) to use. Using $PATH can cause unexpected behaviour and can be a security hazard. There is a web page about binfmt_misc at http://www.tat.physik.uni-tuebingen.de Richard Gunther Linux Braille Console To get early boot messages on a braille device (before userspace screen readers can start), you first need to compile the support for the usual serial console (see serial-console.txt), and for braille device (in Device Drivers - Accessibility). Then you need to specify a console=brl, option on the kernel command line, the format is: console=brl,serial_options... where serial_options... are the same as described in serial-console.txt So for instance you can use console=brl,ttyS0 if the braille device is connected to the first serial port, and console=brl,ttyS0,115200 to override the baud rate to 115200, etc. By default, the braille device will just show the last kernel message (console mode). To review previous messages, press the Insert key to switch to the VT review mode. In review mode, the arrow keys permit to browse in the VT content, page up/down keys go at the top/bottom of the screen, and the home key goes back to the cursor, hence providing very basic screen reviewing facility. Sound feedback can be obtained by adding the braille_console.sound=1 kernel parameter. For simplicity, only one braille console can be enabled, other uses of console=brl,... will be discarded. Also note that it does not interfere with the console selection mechanism described in serial-console.txt For now, only the VisioBraille device is supported. Samuel Thibault =============================================================== == BT8XXGPIO driver == == == == A driver for a selfmade cheap BT8xx based PCI GPIO-card == == == == For advanced documentation, see == == http://www.bu3sch.de/btgpio.php == =============================================================== A generic digital 24-port PCI GPIO card can be built out of an ordinary Brooktree bt848, bt849, bt878 or bt879 based analog TV tuner card. The Brooktree chip is used in old analog Hauppauge WinTV PCI cards. You can easily find them used for low prices on the net. The bt8xx chip does have 24 digital GPIO ports. These ports are accessible via 24 pins on the SMD chip package. ============================================== == How to physically access the GPIO pins == ============================================== The are several ways to access these pins. One might unsolder the whole chip and put it on a custom PCI board, or one might only unsolder each individual GPIO pin and solder that to some tiny wire. As the chip package really is tiny there are some advanced soldering skills needed in any case. The physical pinouts are drawn in the following ASCII art. The GPIO pins are marked with G00-G23 G G G G G G G G G G G G G G G G G G 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | --------------------------------------------------------------------------- --| ^ ^ |-- --| pin 86 pin 67 |-- --| |-- --| pin 61 > |-- G18 --| |-- G19 --| |-- G20 --| |-- G21 --| |-- G22 --| pin 56 > |-- G23 --| |-- --| Brooktree 878/879 |-- --| |-- --| |-- --| |-- --| |-- --| |-- --| |-- --| |-- --| |-- --| |-- --| |-- --| |-- --| |-- --| |-- --| O |-- --| |-- --------------------------------------------------------------------------- | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ^ This is pin 1 ======================================================================= README for btmrvl driver ======================================================================= All commands are used via debugfs interface. ===================== Set/get driver configurations: Path: /debug/btmrvl/config/ gpiogap=[n] hscfgcmd These commands are used to configure the host sleep parameters. bit 8:0 -- Gap bit 16:8 -- GPIO where GPIO is the pin number of GPIO used to wake up the host. It could be any valid GPIO pin# (e.g. 0-7) or 0xff (SDIO interface wakeup will be used instead). where Gap is the gap in milli seconds between wakeup signal and wakeup event, or 0xff for special host sleep setting. Usage: # Use SDIO interface to wake up the host and set GAP to 0x80: echo 0xff80 > /debug/btmrvl/config/gpiogap echo 1 > /debug/btmrvl/config/hscfgcmd # Use GPIO pin #3 to wake up the host and set GAP to 0xff: echo 0x03ff > /debug/btmrvl/config/gpiogap echo 1 > /debug/btmrvl/config/hscfgcmd psmode=[n] pscmd These commands are used to enable/disable auto sleep mode where the option is: 1 -- Enable auto sleep mode 0 -- Disable auto sleep mode Usage: # Enable auto sleep mode echo 1 > /debug/btmrvl/config/psmode echo 1 > /debug/btmrvl/config/pscmd # Disable auto sleep mode echo 0 > /debug/btmrvl/config/psmode echo 1 > /debug/btmrvl/config/pscmd hsmode=[n] hscmd These commands are used to enable host sleep or wake up firmware where the option is: 1 -- Enable host sleep 0 -- Wake up firmware Usage: # Enable host sleep echo 1 > /debug/btmrvl/config/hsmode echo 1 > /debug/btmrvl/config/hscmd # Wake up firmware echo 0 > /debug/btmrvl/config/hsmode echo 1 > /debug/btmrvl/config/hscmd ====================== Get driver status: Path: /debug/btmrvl/status/ Usage: cat /debug/btmrvl/status/ where the args are: curpsmode This command displays current auto sleep status. psstate This command display the power save state. hsstate This command display the host sleep state. txdnldrdy This command displays the value of Tx download ready flag. ===================== Use hcitool to issue raw hci command, refer to hcitool manual Usage: Hcitool cmd [Parameters] Interface Control Command hcitool cmd 0x3f 0x5b 0xf5 0x01 0x00 --Enable All interface hcitool cmd 0x3f 0x5b 0xf5 0x01 0x01 --Enable Wlan interface hcitool cmd 0x3f 0x5b 0xf5 0x01 0x02 --Enable BT interface hcitool cmd 0x3f 0x5b 0xf5 0x00 0x00 --Disable All interface hcitool cmd 0x3f 0x5b 0xf5 0x00 0x01 --Disable Wlan interface hcitool cmd 0x3f 0x5b 0xf5 0x00 0x02 --Disable BT interface ======================================================================= SD8688 firmware: /lib/firmware/sd8688_helper.bin /lib/firmware/sd8688.bin The images can be downloaded from: git.infradead.org/users/dwmw2/linux-firmware.git/libertas/ [ NOTE: The virt_to_bus() and bus_to_virt() functions have been superseded by the functionality provided by the PCI DMA interface (see Documentation/DMA-API-HOWTO.txt). They continue to be documented below for historical purposes, but new code must not use them. --davidm 00/12/12 ] [ This is a mail message in response to a query on IO mapping, thus the strange format for a "document" ] The AHA-1542 is a bus-master device, and your patch makes the driver give the controller the physical address of the buffers, which is correct on x86 (because all bus master devices see the physical memory mappings directly). However, on many setups, there are actually _three_ different ways of looking at memory addresses, and in this case we actually want the third, the so-called "bus address". Essentially, the three ways of addressing memory are (this is "real memory", that is, normal RAM--see later about other details): - CPU untranslated. This is the "physical" address. Physical address 0 is what the CPU sees when it drives zeroes on the memory bus. - CPU translated address. This is the "virtual" address, and is completely internal to the CPU itself with the CPU doing the appropriate translations into "CPU untranslated". - bus address. This is the address of memory as seen by OTHER devices, not the CPU. Now, in theory there could be many different bus addresses, with each device seeing memory in some device-specific way, but happily most hardware designers aren't actually actively trying to make things any more complex than necessary, so you can assume that all external hardware sees the memory the same way. Now, on normal PCs the bus address is exactly the same as the physical address, and things are very simple indeed. However, they are that simple because the memory and the devices share the same address space, and that is not generally necessarily true on other PCI/ISA setups. Now, just as an example, on the PReP (PowerPC Reference Platform), the CPU sees a memory map something like this (this is from memory): 0-2 GB "real memory" 2 GB-3 GB "system IO" (inb/out and similar accesses on x86) 3 GB-4 GB "IO memory" (shared memory over the IO bus) Now, that looks simple enough. However, when you look at the same thing from the viewpoint of the devices, you have the reverse, and the physical memory address 0 actually shows up as address 2 GB for any IO master. So when the CPU wants any bus master to write to physical memory 0, it has to give the master address 0x80000000 as the memory address. So, for example, depending on how the kernel is actually mapped on the PPC, you can end up with a setup like this: physical address: 0 virtual address: 0xC0000000 bus address: 0x80000000 where all the addresses actually point to the same thing. It's just seen through different translations.. Similarly, on the Alpha, the normal translation is physical address: 0 virtual address: 0xfffffc0000000000 bus address: 0x40000000 (but there are also Alphas where the physical address and the bus address are the same). Anyway, the way to look up all these translations, you do #include phys_addr = virt_to_phys(virt_addr); virt_addr = phys_to_virt(phys_addr); bus_addr = virt_to_bus(virt_addr); virt_addr = bus_to_virt(bus_addr); Now, when do you need these? You want the _virtual_ address when you are actually going to access that pointer from the kernel. So you can have something like this: /* * this is the hardware "mailbox" we use to communicate with * the controller. The controller sees this directly. */ struct mailbox { __u32 status; __u32 bufstart; __u32 buflen; .. } mbox; unsigned char * retbuffer; /* get the address from the controller */ retbuffer = bus_to_virt(mbox.bufstart); switch (retbuffer[0]) { case STATUS_OK: ... on the other hand, you want the bus address when you have a buffer that you want to give to the controller: /* ask the controller to read the sense status into "sense_buffer" */ mbox.bufstart = virt_to_bus(&sense_buffer); mbox.buflen = sizeof(sense_buffer); mbox.status = 0; notify_controller(&mbox); And you generally _never_ want to use the physical address, because you can't use that from the CPU (the CPU only uses translated virtual addresses), and you can't use it from the bus master. So why do we care about the physical address at all? We do need the physical address in some cases, it's just not very often in normal code. The physical address is needed if you use memory mappings, for example, because the "remap_pfn_range()" mm function wants the physical address of the memory to be remapped as measured in units of pages, a.k.a. the pfn (the memory management layer doesn't know about devices outside the CPU, so it shouldn't need to know about "bus addresses" etc). NOTE NOTE NOTE! The above is only one part of the whole equation. The above only talks about "real memory", that is, CPU memory (RAM). There is a completely different type of memory too, and that's the "shared memory" on the PCI or ISA bus. That's generally not RAM (although in the case of a video graphics card it can be normal DRAM that is just used for a frame buffer), but can be things like a packet buffer in a network card etc. This memory is called "PCI memory" or "shared memory" or "IO memory" or whatever, and there is only one way to access it: the readb/writeb and related functions. You should never take the address of such memory, because there is really nothing you can do with such an address: it's not conceptually in the same memory space as "real memory" at all, so you cannot just dereference a pointer. (Sadly, on x86 it _is_ in the same memory space, so on x86 it actually works to just deference a pointer, but it's not portable). For such memory, you can do things like - reading: /* * read first 32 bits from ISA memory at 0xC0000, aka * C000:0000 in DOS terms */ unsigned int signature = isa_readl(0xC0000); - remapping and writing: /* * remap framebuffer PCI memory area at 0xFC000000, * size 1MB, so that we can access it: We can directly * access only the 640k-1MB area, so anything else * has to be remapped. */ void __iomem *baseptr = ioremap(0xFC000000, 1024*1024); /* write a 'A' to the offset 10 of the area */ writeb('A',baseptr+10); /* unmap when we unload the driver */ iounmap(baseptr); - copying and clearing: /* get the 6-byte Ethernet address at ISA address E000:0040 */ memcpy_fromio(kernel_buffer, 0xE0040, 6); /* write a packet to the driver */ memcpy_toio(0xE1000, skb->data, skb->len); /* clear the frame buffer */ memset_io(0xA0000, 0, 0x10000); OK, that just about covers the basics of accessing IO portably. Questions? Comments? You may think that all the above is overly complex, but one day you might find yourself with a 500 MHz Alpha in front of you, and then you'll be happy that your driver works ;) Note that kernel versions 2.0.x (and earlier) mistakenly called the ioremap() function "vremap()". ioremap() is the proper name, but I didn't think straight when I wrote it originally. People who have to support both can do something like: /* support old naming silliness */ #if LINUX_VERSION_CODE < 0x020100 #define ioremap vremap #define iounmap vfree #endif at the top of their source files, and then they can use the right names even on 2.0.x systems. And the above sounds worse than it really is. Most real drivers really don't do all that complex things (or rather: the complexity is not so much in the actual IO accesses as in error handling and timeouts etc). It's generally not hard to fix drivers, and in many cases the code actually looks better afterwards: unsigned long signature = *(unsigned int *) 0xC0000; vs unsigned long signature = readl(0xC0000); I think the second version actually is more readable, no? Linus Cache and TLB Flushing Under Linux David S. Miller This document describes the cache/tlb flushing interfaces called by the Linux VM subsystem. It enumerates over each interface, describes its intended purpose, and what side effect is expected after the interface is invoked. The side effects described below are stated for a uniprocessor implementation, and what is to happen on that single processor. The SMP cases are a simple extension, in that you just extend the definition such that the side effect for a particular interface occurs on all processors in the system. Don't let this scare you into thinking SMP cache/tlb flushing must be so inefficient, this is in fact an area where many optimizations are possible. For example, if it can be proven that a user address space has never executed on a cpu (see mm_cpumask()), one need not perform a flush for this address space on that cpu. First, the TLB flushing interfaces, since they are the simplest. The "TLB" is abstracted under Linux as something the cpu uses to cache virtual-->physical address translations obtained from the software page tables. Meaning that if the software page tables change, it is possible for stale translations to exist in this "TLB" cache. Therefore when software page table changes occur, the kernel will invoke one of the following flush methods _after_ the page table changes occur: 1) void flush_tlb_all(void) The most severe flush of all. After this interface runs, any previous page table modification whatsoever will be visible to the cpu. This is usually invoked when the kernel page tables are changed, since such translations are "global" in nature. 2) void flush_tlb_mm(struct mm_struct *mm) This interface flushes an entire user address space from the TLB. After running, this interface must make sure that any previous page table modifications for the address space 'mm' will be visible to the cpu. That is, after running, there will be no entries in the TLB for 'mm'. This interface is used to handle whole address space page table operations such as what happens during fork, and exec. 3) void flush_tlb_range(struct vm_area_struct *vma, unsigned long start, unsigned long end) Here we are flushing a specific range of (user) virtual address translations from the TLB. After running, this interface must make sure that any previous page table modifications for the address space 'vma->vm_mm' in the range 'start' to 'end-1' will be visible to the cpu. That is, after running, here will be no entries in the TLB for 'mm' for virtual addresses in the range 'start' to 'end-1'. The "vma" is the backing store being used for the region. Primarily, this is used for munmap() type operations. The interface is provided in hopes that the port can find a suitably efficient method for removing multiple page sized translations from the TLB, instead of having the kernel call flush_tlb_page (see below) for each entry which may be modified. 4) void flush_tlb_page(struct vm_area_struct *vma, unsigned long addr) This time we need to remove the PAGE_SIZE sized translation from the TLB. The 'vma' is the backing structure used by Linux to keep track of mmap'd regions for a process, the address space is available via vma->vm_mm. Also, one may test (vma->vm_flags & VM_EXEC) to see if this region is executable (and thus could be in the 'instruction TLB' in split-tlb type setups). After running, this interface must make sure that any previous page table modification for address space 'vma->vm_mm' for user virtual address 'addr' will be visible to the cpu. That is, after running, there will be no entries in the TLB for 'vma->vm_mm' for virtual address 'addr'. This is used primarily during fault processing. 5) void update_mmu_cache(struct vm_area_struct *vma, unsigned long address, pte_t *ptep) At the end of every page fault, this routine is invoked to tell the architecture specific code that a translation now exists at virtual address "address" for address space "vma->vm_mm", in the software page tables. A port may use this information in any way it so chooses. For example, it could use this event to pre-load TLB translations for software managed TLB configurations. The sparc64 port currently does this. 6) void tlb_migrate_finish(struct mm_struct *mm) This interface is called at the end of an explicit process migration. This interface provides a hook to allow a platform to update TLB or context-specific information for the address space. The ia64 sn2 platform is one example of a platform that uses this interface. Next, we have the cache flushing interfaces. In general, when Linux is changing an existing virtual-->physical mapping to a new value, the sequence will be in one of the following forms: 1) flush_cache_mm(mm); change_all_page_tables_of(mm); flush_tlb_mm(mm); 2) flush_cache_range(vma, start, end); change_range_of_page_tables(mm, start, end); flush_tlb_range(vma, start, end); 3) flush_cache_page(vma, addr, pfn); set_pte(pte_pointer, new_pte_val); flush_tlb_page(vma, addr); The cache level flush will always be first, because this allows us to properly handle systems whose caches are strict and require a virtual-->physical translation to exist for a virtual address when that virtual address is flushed from the cache. The HyperSparc cpu is one such cpu with this attribute. The cache flushing routines below need only deal with cache flushing to the extent that it is necessary for a particular cpu. Mostly, these routines must be implemented for cpus which have virtually indexed caches which must be flushed when virtual-->physical translations are changed or removed. So, for example, the physically indexed physically tagged caches of IA32 processors have no need to implement these interfaces since the caches are fully synchronized and have no dependency on translation information. Here are the routines, one by one: 1) void flush_cache_mm(struct mm_struct *mm) This interface flushes an entire user address space from the caches. That is, after running, there will be no cache lines associated with 'mm'. This interface is used to handle whole address space page table operations such as what happens during exit and exec. 2) void flush_cache_dup_mm(struct mm_struct *mm) This interface flushes an entire user address space from the caches. That is, after running, there will be no cache lines associated with 'mm'. This interface is used to handle whole address space page table operations such as what happens during fork. This option is separate from flush_cache_mm to allow some optimizations for VIPT caches. 3) void flush_cache_range(struct vm_area_struct *vma, unsigned long start, unsigned long end) Here we are flushing a specific range of (user) virtual addresses from the cache. After running, there will be no entries in the cache for 'vma->vm_mm' for virtual addresses in the range 'start' to 'end-1'. The "vma" is the backing store being used for the region. Primarily, this is used for munmap() type operations. The interface is provided in hopes that the port can find a suitably efficient method for removing multiple page sized regions from the cache, instead of having the kernel call flush_cache_page (see below) for each entry which may be modified. 4) void flush_cache_page(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn) This time we need to remove a PAGE_SIZE sized range from the cache. The 'vma' is the backing structure used by Linux to keep track of mmap'd regions for a process, the address space is available via vma->vm_mm. Also, one may test (vma->vm_flags & VM_EXEC) to see if this region is executable (and thus could be in the 'instruction cache' in "Harvard" type cache layouts). The 'pfn' indicates the physical page frame (shift this value left by PAGE_SHIFT to get the physical address) that 'addr' translates to. It is this mapping which should be removed from the cache. After running, there will be no entries in the cache for 'vma->vm_mm' for virtual address 'addr' which translates to 'pfn'. This is used primarily during fault processing. 5) void flush_cache_kmaps(void) This routine need only be implemented if the platform utilizes highmem. It will be called right before all of the kmaps are invalidated. After running, there will be no entries in the cache for the kernel virtual address range PKMAP_ADDR(0) to PKMAP_ADDR(LAST_PKMAP). This routing should be implemented in asm/highmem.h 6) void flush_cache_vmap(unsigned long start, unsigned long end) void flush_cache_vunmap(unsigned long start, unsigned long end) Here in these two interfaces we are flushing a specific range of (kernel) virtual addresses from the cache. After running, there will be no entries in the cache for the kernel address space for virtual addresses in the range 'start' to 'end-1'. The first of these two routines is invoked after map_vm_area() has installed the page table entries. The second is invoked before unmap_kernel_range() deletes the page table entries. There exists another whole class of cpu cache issues which currently require a whole different set of interfaces to handle properly. The biggest problem is that of virtual aliasing in the data cache of a processor. Is your port susceptible to virtual aliasing in its D-cache? Well, if your D-cache is virtually indexed, is larger in size than PAGE_SIZE, and does not prevent multiple cache lines for the same physical address from existing at once, you have this problem. If your D-cache has this problem, first define asm/shmparam.h SHMLBA properly, it should essentially be the size of your virtually addressed D-cache (or if the size is variable, the largest possible size). This setting will force the SYSv IPC layer to only allow user processes to mmap shared memory at address which are a multiple of this value. NOTE: This does not fix shared mmaps, check out the sparc64 port for one way to solve this (in particular SPARC_FLAG_MMAPSHARED). Next, you have to solve the D-cache aliasing issue for all other cases. Please keep in mind that fact that, for a given page mapped into some user address space, there is always at least one more mapping, that of the kernel in its linear mapping starting at PAGE_OFFSET. So immediately, once the first user maps a given physical page into its address space, by implication the D-cache aliasing problem has the potential to exist since the kernel already maps this page at its virtual address. void copy_user_page(void *to, void *from, unsigned long addr, struct page *page) void clear_user_page(void *to, unsigned long addr, struct page *page) These two routines store data in user anonymous or COW pages. It allows a port to efficiently avoid D-cache alias issues between userspace and the kernel. For example, a port may temporarily map 'from' and 'to' to kernel virtual addresses during the copy. The virtual address for these two pages is chosen in such a way that the kernel load/store instructions happen to virtual addresses which are of the same "color" as the user mapping of the page. Sparc64 for example, uses this technique. The 'addr' parameter tells the virtual address where the user will ultimately have this page mapped, and the 'page' parameter gives a pointer to the struct page of the target. If D-cache aliasing is not an issue, these two routines may simply call memcpy/memset directly and do nothing more. void flush_dcache_page(struct page *page) Any time the kernel writes to a page cache page, _OR_ the kernel is about to read from a page cache page and user space shared/writable mappings of this page potentially exist, this routine is called. NOTE: This routine need only be called for page cache pages which can potentially ever be mapped into the address space of a user process. So for example, VFS layer code handling vfs symlinks in the page cache need not call this interface at all. The phrase "kernel writes to a page cache page" means, specifically, that the kernel executes store instructions that dirty data in that page at the page->virtual mapping of that page. It is important to flush here to handle D-cache aliasing, to make sure these kernel stores are visible to user space mappings of that page. The corollary case is just as important, if there are users which have shared+writable mappings of this file, we must make sure that kernel reads of these pages will see the most recent stores done by the user. If D-cache aliasing is not an issue, this routine may simply be defined as a nop on that architecture. There is a bit set aside in page->flags (PG_arch_1) as "architecture private". The kernel guarantees that, for pagecache pages, it will clear this bit when such a page first enters the pagecache. This allows these interfaces to be implemented much more efficiently. It allows one to "defer" (perhaps indefinitely) the actual flush if there are currently no user processes mapping this page. See sparc64's flush_dcache_page and update_mmu_cache implementations for an example of how to go about doing this. The idea is, first at flush_dcache_page() time, if page->mapping->i_mmap is an empty tree and ->i_mmap_nonlinear an empty list, just mark the architecture private page flag bit. Later, in update_mmu_cache(), a check is made of this flag bit, and if set the flush is done and the flag bit is cleared. IMPORTANT NOTE: It is often important, if you defer the flush, that the actual flush occurs on the same CPU as did the cpu stores into the page to make it dirty. Again, see sparc64 for examples of how to deal with this. void copy_to_user_page(struct vm_area_struct *vma, struct page *page, unsigned long user_vaddr, void *dst, void *src, int len) void copy_from_user_page(struct vm_area_struct *vma, struct page *page, unsigned long user_vaddr, void *dst, void *src, int len) When the kernel needs to copy arbitrary data in and out of arbitrary user pages (f.e. for ptrace()) it will use these two routines. Any necessary cache flushing or other coherency operations that need to occur should happen here. If the processor's instruction cache does not snoop cpu stores, it is very likely that you will need to flush the instruction cache for copy_to_user_page(). void flush_anon_page(struct vm_area_struct *vma, struct page *page, unsigned long vmaddr) When the kernel needs to access the contents of an anonymous page, it calls this function (currently only get_user_pages()). Note: flush_dcache_page() deliberately doesn't work for an anonymous page. The default implementation is a nop (and should remain so for all coherent architectures). For incoherent architectures, it should flush the cache of the page at vmaddr. void flush_kernel_dcache_page(struct page *page) When the kernel needs to modify a user page is has obtained with kmap, it calls this function after all modifications are complete (but before kunmapping it) to bring the underlying page up to date. It is assumed here that the user has no incoherent cached copies (i.e. the original page was obtained from a mechanism like get_user_pages()). The default implementation is a nop and should remain so on all coherent architectures. On incoherent architectures, this should flush the kernel cache for page (using page_address(page)). void flush_icache_range(unsigned long start, unsigned long end) When the kernel stores into addresses that it will execute out of (eg when loading modules), this function is called. If the icache does not snoop stores then this routine will need to flush it. void flush_icache_page(struct vm_area_struct *vma, struct page *page) All the functionality of flush_icache_page can be implemented in flush_dcache_page and update_mmu_cache. In 2.7 the hope is to remove this interface completely. The final category of APIs is for I/O to deliberately aliased address ranges inside the kernel. Such aliases are set up by use of the vmap/vmalloc API. Since kernel I/O goes via physical pages, the I/O subsystem assumes that the user mapping and kernel offset mapping are the only aliases. This isn't true for vmap aliases, so anything in the kernel trying to do I/O to vmap areas must manually manage coherency. It must do this by flushing the vmap range before doing I/O and invalidating it after the I/O returns. void flush_kernel_vmap_range(void *vaddr, int size) flushes the kernel cache for a given virtual address range in the vmap area. This is to make sure that any data the kernel modified in the vmap range is made visible to the physical page. The design is to make this area safe to perform I/O on. Note that this API does *not* also flush the offset map alias of the area. void invalidate_kernel_vmap_range(void *vaddr, int size) invalidates the cache for a given virtual address range in the vmap area which prevents the processor from making the cache stale by speculatively reading data while the I/O was occurring to the physical pages. This is only necessary for data reads into the vmap area. ================ CIRCULAR BUFFERS ================ By: David Howells Paul E. McKenney Linux provides a number of features that can be used to implement circular buffering. There are two sets of such features: (1) Convenience functions for determining information about power-of-2 sized buffers. (2) Memory barriers for when the producer and the consumer of objects in the buffer don't want to share a lock. To use these facilities, as discussed below, there needs to be just one producer and just one consumer. It is possible to handle multiple producers by serialising them, and to handle multiple consumers by serialising them. Contents: (*) What is a circular buffer? (*) Measuring power-of-2 buffers. (*) Using memory barriers with circular buffers. - The producer. - The consumer. ========================== WHAT IS A CIRCULAR BUFFER? ========================== First of all, what is a circular buffer? A circular buffer is a buffer of fixed, finite size into which there are two indices: (1) A 'head' index - the point at which the producer inserts items into the buffer. (2) A 'tail' index - the point at which the consumer finds the next item in the buffer. Typically when the tail pointer is equal to the head pointer, the buffer is empty; and the buffer is full when the head pointer is one less than the tail pointer. The head index is incremented when items are added, and the tail index when items are removed. The tail index should never jump the head index, and both indices should be wrapped to 0 when they reach the end of the buffer, thus allowing an infinite amount of data to flow through the buffer. Typically, items will all be of the same unit size, but this isn't strictly required to use the techniques below. The indices can be increased by more than 1 if multiple items or variable-sized items are to be included in the buffer, provided that neither index overtakes the other. The implementer must be careful, however, as a region more than one unit in size may wrap the end of the buffer and be broken into two segments. ============================ MEASURING POWER-OF-2 BUFFERS ============================ Calculation of the occupancy or the remaining capacity of an arbitrarily sized circular buffer would normally be a slow operation, requiring the use of a modulus (divide) instruction. However, if the buffer is of a power-of-2 size, then a much quicker bitwise-AND instruction can be used instead. Linux provides a set of macros for handling power-of-2 circular buffers. These can be made use of by: #include The macros are: (*) Measure the remaining capacity of a buffer: CIRC_SPACE(head_index, tail_index, buffer_size); This returns the amount of space left in the buffer[1] into which items can be inserted. (*) Measure the maximum consecutive immediate space in a buffer: CIRC_SPACE_TO_END(head_index, tail_index, buffer_size); This returns the amount of consecutive space left in the buffer[1] into which items can be immediately inserted without having to wrap back to the beginning of the buffer. (*) Measure the occupancy of a buffer: CIRC_CNT(head_index, tail_index, buffer_size); This returns the number of items currently occupying a buffer[2]. (*) Measure the non-wrapping occupancy of a buffer: CIRC_CNT_TO_END(head_index, tail_index, buffer_size); This returns the number of consecutive items[2] that can be extracted from the buffer without having to wrap back to the beginning of the buffer. Each of these macros will nominally return a value between 0 and buffer_size-1, however: [1] CIRC_SPACE*() are intended to be used in the producer. To the producer they will return a lower bound as the producer controls the head index, but the consumer may still be depleting the buffer on another CPU and moving the tail index. To the consumer it will show an upper bound as the producer may be busy depleting the space. [2] CIRC_CNT*() are intended to be used in the consumer. To the consumer they will return a lower bound as the consumer controls the tail index, but the producer may still be filling the buffer on another CPU and moving the head index. To the producer it will show an upper bound as the consumer may be busy emptying the buffer. [3] To a third party, the order in which the writes to the indices by the producer and consumer become visible cannot be guaranteed as they are independent and may be made on different CPUs - so the result in such a situation will merely be a guess, and may even be negative. =========================================== USING MEMORY BARRIERS WITH CIRCULAR BUFFERS =========================================== By using memory barriers in conjunction with circular buffers, you can avoid the need to: (1) use a single lock to govern access to both ends of the buffer, thus allowing the buffer to be filled and emptied at the same time; and (2) use atomic counter operations. There are two sides to this: the producer that fills the buffer, and the consumer that empties it. Only one thing should be filling a buffer at any one time, and only one thing should be emptying a buffer at any one time, but the two sides can operate simultaneously. THE PRODUCER ------------ The producer will look something like this: spin_lock(&producer_lock); unsigned long head = buffer->head; unsigned long tail = ACCESS_ONCE(buffer->tail); if (CIRC_SPACE(head, tail, buffer->size) >= 1) { /* insert one item into the buffer */ struct item *item = buffer[head]; produce_item(item); smp_wmb(); /* commit the item before incrementing the head */ buffer->head = (head + 1) & (buffer->size - 1); /* wake_up() will make sure that the head is committed before * waking anyone up */ wake_up(consumer); } spin_unlock(&producer_lock); This will instruct the CPU that the contents of the new item must be written before the head index makes it available to the consumer and then instructs the CPU that the revised head index must be written before the consumer is woken. Note that wake_up() doesn't have to be the exact mechanism used, but whatever is used must guarantee a (write) memory barrier between the update of the head index and the change of state of the consumer, if a change of state occurs. THE CONSUMER ------------ The consumer will look something like this: spin_lock(&consumer_lock); unsigned long head = ACCESS_ONCE(buffer->head); unsigned long tail = buffer->tail; if (CIRC_CNT(head, tail, buffer->size) >= 1) { /* read index before reading contents at that index */ smp_read_barrier_depends(); /* extract one item from the buffer */ struct item *item = buffer[tail]; consume_item(item); smp_mb(); /* finish reading descriptor before incrementing tail */ buffer->tail = (tail + 1) & (buffer->size - 1); } spin_unlock(&consumer_lock); This will instruct the CPU to make sure the index is up to date before reading the new item, and then it shall make sure the CPU has finished reading the item before it writes the new tail pointer, which will erase the item. Note the use of ACCESS_ONCE() in both algorithms to read the opposition index. This prevents the compiler from discarding and reloading its cached value - which some compilers will do across smp_read_barrier_depends(). This isn't strictly needed if you can be sure that the opposition index will _only_ be used the once. =============== FURTHER READING =============== See also Documentation/memory-barriers.txt for a description of Linux's memory barrier facilities. Copyright 2010 Nicolas Palix Copyright 2010 Julia Lawall Copyright 2010 Gilles Muller Getting Coccinelle ~~~~~~~~~~~~~~~~~~~~ The semantic patches included in the kernel use the 'virtual rule' feature which was introduced in Coccinelle version 0.1.11. Coccinelle (>=0.2.0) is available through the package manager of many distributions, e.g. : - Debian (>=squeeze) - Fedora (>=13) - Ubuntu (>=10.04 Lucid Lynx) - OpenSUSE - Arch Linux - NetBSD - FreeBSD You can get the latest version released from the Coccinelle homepage at http://coccinelle.lip6.fr/ Information and tips about Coccinelle are also provided on the wiki pages at http://cocci.ekstranet.diku.dk/wiki/doku.php Once you have it, run the following command: ./configure make as a regular user, and install it with sudo make install The semantic patches in the kernel will work best with Coccinelle version 0.2.4 or later. Using earlier versions may incur some parse errors in the semantic patch code, but any results that are obtained should still be correct. Using Coccinelle on the Linux kernel ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ A Coccinelle-specific target is defined in the top level Makefile. This target is named 'coccicheck' and calls the 'coccicheck' front-end in the 'scripts' directory. Four modes are defined: patch, report, context, and org. The mode to use is specified by setting the MODE variable with 'MODE='. 'patch' proposes a fix, when possible. 'report' generates a list in the following format: file:line:column-column: message 'context' highlights lines of interest and their context in a diff-like style.Lines of interest are indicated with '-'. 'org' generates a report in the Org mode format of Emacs. Note that not all semantic patches implement all modes. For easy use of Coccinelle, the default mode is "chain" which tries the previous modes in the order above until one succeeds. To make a report for every semantic patch, run the following command: make coccicheck MODE=report NB: The 'report' mode is the default one. To produce patches, run: make coccicheck MODE=patch The coccicheck target applies every semantic patch available in the sub-directories of 'scripts/coccinelle' to the entire Linux kernel. For each semantic patch, a commit message is proposed. It gives a description of the problem being checked by the semantic patch, and includes a reference to Coccinelle. As any static code analyzer, Coccinelle produces false positives. Thus, reports must be carefully checked, and patches reviewed. Using Coccinelle with a single semantic patch ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The optional make variable COCCI can be used to check a single semantic patch. In that case, the variable must be initialized with the name of the semantic patch to apply. For instance: make coccicheck COCCI= MODE=patch or make coccicheck COCCI= MODE=report Controlling Which Files are Processed by Coccinelle ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ By default the entire kernel source tree is checked. To apply Coccinelle to a specific directory, M= can be used. For example, to check drivers/net/wireless/ one may write: make coccicheck M=drivers/net/wireless/ To apply Coccinelle on a file basis, instead of a directory basis, the following command may be used: make C=1 CHECK="scripts/coccicheck" To check only newly edited code, use the value 2 for the C flag, i.e. make C=2 CHECK="scripts/coccicheck" This runs every semantic patch in scripts/coccinelle by default. The COCCI variable may additionally be used to only apply a single semantic patch as shown in the previous section. The "chain" mode is the default. You can select another one with the MODE variable explained above. In this mode, there is no information about semantic patches displayed, and no commit message proposed. Proposing new semantic patches ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ New semantic patches can be proposed and submitted by kernel developers. For sake of clarity, they should be organized in the sub-directories of 'scripts/coccinelle/'. Detailed description of the 'report' mode ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 'report' generates a list in the following format: file:line:column-column: message Example: Running make coccicheck MODE=report COCCI=scripts/coccinelle/api/err_cast.cocci will execute the following part of the SmPL script. @r depends on !context && !patch && (org || report)@ expression x; position p; @@ ERR_PTR@p(PTR_ERR(x)) @script:python depends on report@ p << r.p; x << r.x; @@ msg="ERR_CAST can be used with %s" % (x) coccilib.report.print_report(p[0], msg) This SmPL excerpt generates entries on the standard output, as illustrated below: /home/user/linux/crypto/ctr.c:188:9-16: ERR_CAST can be used with alg /home/user/linux/crypto/authenc.c:619:9-16: ERR_CAST can be used with auth /home/user/linux/crypto/xts.c:227:9-16: ERR_CAST can be used with alg Detailed description of the 'patch' mode ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ When the 'patch' mode is available, it proposes a fix for each problem identified. Example: Running make coccicheck MODE=patch COCCI=scripts/coccinelle/api/err_cast.cocci will execute the following part of the SmPL script. @ depends on !context && patch && !org && !report @ expression x; @@ - ERR_PTR(PTR_ERR(x)) + ERR_CAST(x) This SmPL excerpt generates patch hunks on the standard output, as illustrated below: diff -u -p a/crypto/ctr.c b/crypto/ctr.c --- a/crypto/ctr.c 2010-05-26 10:49:38.000000000 +0200 +++ b/crypto/ctr.c 2010-06-03 23:44:49.000000000 +0200 @@ -185,7 +185,7 @@ static struct crypto_instance *crypto_ct alg = crypto_attr_alg(tb[1], CRYPTO_ALG_TYPE_CIPHER, CRYPTO_ALG_TYPE_MASK); if (IS_ERR(alg)) - return ERR_PTR(PTR_ERR(alg)); + return ERR_CAST(alg); /* Block size must be >= 4 bytes. */ err = -EINVAL; Detailed description of the 'context' mode ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 'context' highlights lines of interest and their context in a diff-like style. NOTE: The diff-like output generated is NOT an applicable patch. The intent of the 'context' mode is to highlight the important lines (annotated with minus, '-') and gives some surrounding context lines around. This output can be used with the diff mode of Emacs to review the code. Example: Running make coccicheck MODE=context COCCI=scripts/coccinelle/api/err_cast.cocci will execute the following part of the SmPL script. @ depends on context && !patch && !org && !report@ expression x; @@ * ERR_PTR(PTR_ERR(x)) This SmPL excerpt generates diff hunks on the standard output, as illustrated below: diff -u -p /home/user/linux/crypto/ctr.c /tmp/nothing --- /home/user/linux/crypto/ctr.c 2010-05-26 10:49:38.000000000 +0200 +++ /tmp/nothing @@ -185,7 +185,6 @@ static struct crypto_instance *crypto_ct alg = crypto_attr_alg(tb[1], CRYPTO_ALG_TYPE_CIPHER, CRYPTO_ALG_TYPE_MASK); if (IS_ERR(alg)) - return ERR_PTR(PTR_ERR(alg)); /* Block size must be >= 4 bytes. */ err = -EINVAL; Detailed description of the 'org' mode ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 'org' generates a report in the Org mode format of Emacs. Example: Running make coccicheck MODE=org COCCI=scripts/coccinelle/api/err_cast.cocci will execute the following part of the SmPL script. @r depends on !context && !patch && (org || report)@ expression x; position p; @@ ERR_PTR@p(PTR_ERR(x)) @script:python depends on org@ p << r.p; x << r.x; @@ msg="ERR_CAST can be used with %s" % (x) msg_safe=msg.replace("[","@(").replace("]",")") coccilib.org.print_todo(p[0], msg_safe) This SmPL excerpt generates Org entries on the standard output, as illustrated below: * TODO [[view:/home/user/linux/crypto/ctr.c::face=ovl-face1::linb=188::colb=9::cole=16][ERR_CAST can be used with alg]] * TODO [[view:/home/user/linux/crypto/authenc.c::face=ovl-face1::linb=619::colb=9::cole=16][ERR_CAST can be used with auth]] * TODO [[view:/home/user/linux/crypto/xts.c::face=ovl-face1::linb=227::colb=9::cole=16][ERR_CAST can be used with alg]] CPU hotplug Support in Linux(tm) Kernel Maintainers: CPU Hotplug Core: Rusty Russell Srivatsa Vaddagiri i386: Zwane Mwaikambo ppc64: Nathan Lynch Joel Schopp ia64/x86_64: Ashok Raj s390: Heiko Carstens Authors: Ashok Raj Lots of feedback: Nathan Lynch , Joel Schopp Introduction Modern advances in system architectures have introduced advanced error reporting and correction capabilities in processors. CPU architectures permit partitioning support, where compute resources of a single CPU could be made available to virtual machine environments. There are couple OEMS that support NUMA hardware which are hot pluggable as well, where physical node insertion and removal require support for CPU hotplug. Such advances require CPUs available to a kernel to be removed either for provisioning reasons, or for RAS purposes to keep an offending CPU off system execution path. Hence the need for CPU hotplug support in the Linux kernel. A more novel use of CPU-hotplug support is its use today in suspend resume support for SMP. Dual-core and HT support makes even a laptop run SMP kernels which didn't support these methods. SMP support for suspend/resume is a work in progress. General Stuff about CPU Hotplug -------------------------------- Command Line Switches --------------------- maxcpus=n Restrict boot time cpus to n. Say if you have 4 cpus, using maxcpus=2 will only boot 2. You can choose to bring the other cpus later online, read FAQ's for more info. additional_cpus=n (*) Use this to limit hotpluggable cpus. This option sets cpu_possible_map = cpu_present_map + additional_cpus cede_offline={"off","on"} Use this option to disable/enable putting offlined processors to an extended H_CEDE state on supported pseries platforms. If nothing is specified, cede_offline is set to "on". (*) Option valid only for following architectures - ia64 ia64 uses the number of disabled local apics in ACPI tables MADT to determine the number of potentially hot-pluggable cpus. The implementation should only rely on this to count the # of cpus, but *MUST* not rely on the apicid values in those tables for disabled apics. In the event BIOS doesn't mark such hot-pluggable cpus as disabled entries, one could use this parameter "additional_cpus=x" to represent those cpus in the cpu_possible_map. possible_cpus=n [s390,x86_64] use this to set hotpluggable cpus. This option sets possible_cpus bits in cpu_possible_map. Thus keeping the numbers of bits set constant even if the machine gets rebooted. CPU maps and such ----------------- [More on cpumaps and primitive to manipulate, please check include/linux/cpumask.h that has more descriptive text.] cpu_possible_map: Bitmap of possible CPUs that can ever be available in the system. This is used to allocate some boot time memory for per_cpu variables that aren't designed to grow/shrink as CPUs are made available or removed. Once set during boot time discovery phase, the map is static, i.e no bits are added or removed anytime. Trimming it accurately for your system needs upfront can save some boot time memory. See below for how we use heuristics in x86_64 case to keep this under check. cpu_online_map: Bitmap of all CPUs currently online. Its set in __cpu_up() after a cpu is available for kernel scheduling and ready to receive interrupts from devices. Its cleared when a cpu is brought down using __cpu_disable(), before which all OS services including interrupts are migrated to another target CPU. cpu_present_map: Bitmap of CPUs currently present in the system. Not all of them may be online. When physical hotplug is processed by the relevant subsystem (e.g ACPI) can change and new bit either be added or removed from the map depending on the event is hot-add/hot-remove. There are currently no locking rules as of now. Typical usage is to init topology during boot, at which time hotplug is disabled. You really dont need to manipulate any of the system cpu maps. They should be read-only for most use. When setting up per-cpu resources almost always use cpu_possible_map/for_each_possible_cpu() to iterate. Never use anything other than cpumask_t to represent bitmap of CPUs. #include for_each_possible_cpu - Iterate over cpu_possible_map for_each_online_cpu - Iterate over cpu_online_map for_each_present_cpu - Iterate over cpu_present_map for_each_cpu_mask(x,mask) - Iterate over some random collection of cpu mask. #include get_online_cpus() and put_online_cpus(): The above calls are used to inhibit cpu hotplug operations. While the cpu_hotplug.refcount is non zero, the cpu_online_map will not change. If you merely need to avoid cpus going away, you could also use preempt_disable() and preempt_enable() for those sections. Just remember the critical section cannot call any function that can sleep or schedule this process away. The preempt_disable() will work as long as stop_machine_run() is used to take a cpu down. CPU Hotplug - Frequently Asked Questions. Q: How to enable my kernel to support CPU hotplug? A: When doing make defconfig, Enable CPU hotplug support "Processor type and Features" -> Support for Hotpluggable CPUs Make sure that you have CONFIG_HOTPLUG, and CONFIG_SMP turned on as well. You would need to enable CONFIG_HOTPLUG_CPU for SMP suspend/resume support as well. Q: What architectures support CPU hotplug? A: As of 2.6.14, the following architectures support CPU hotplug. i386 (Intel), ppc, ppc64, parisc, s390, ia64 and x86_64 Q: How to test if hotplug is supported on the newly built kernel? A: You should now notice an entry in sysfs. Check if sysfs is mounted, using the "mount" command. You should notice an entry as shown below in the output. .... none on /sys type sysfs (rw) .... If this is not mounted, do the following. #mkdir /sysfs #mount -t sysfs sys /sys Now you should see entries for all present cpu, the following is an example in a 8-way system. #pwd #/sys/devices/system/cpu #ls -l total 0 drwxr-xr-x 10 root root 0 Sep 19 07:44 . drwxr-xr-x 13 root root 0 Sep 19 07:45 .. drwxr-xr-x 3 root root 0 Sep 19 07:44 cpu0 drwxr-xr-x 3 root root 0 Sep 19 07:44 cpu1 drwxr-xr-x 3 root root 0 Sep 19 07:44 cpu2 drwxr-xr-x 3 root root 0 Sep 19 07:44 cpu3 drwxr-xr-x 3 root root 0 Sep 19 07:44 cpu4 drwxr-xr-x 3 root root 0 Sep 19 07:44 cpu5 drwxr-xr-x 3 root root 0 Sep 19 07:44 cpu6 drwxr-xr-x 3 root root 0 Sep 19 07:48 cpu7 Under each directory you would find an "online" file which is the control file to logically online/offline a processor. Q: Does hot-add/hot-remove refer to physical add/remove of cpus? A: The usage of hot-add/remove may not be very consistently used in the code. CONFIG_HOTPLUG_CPU enables logical online/offline capability in the kernel. To support physical addition/removal, one would need some BIOS hooks and the platform should have something like an attention button in PCI hotplug. CONFIG_ACPI_HOTPLUG_CPU enables ACPI support for physical add/remove of CPUs. Q: How do i logically offline a CPU? A: Do the following. #echo 0 > /sys/devices/system/cpu/cpuX/online Once the logical offline is successful, check #cat /proc/interrupts You should now not see the CPU that you removed. Also online file will report the state as 0 when a cpu if offline and 1 when its online. #To display the current cpu state. #cat /sys/devices/system/cpu/cpuX/online Q: Why can't i remove CPU0 on some systems? A: Some architectures may have some special dependency on a certain CPU. For e.g in IA64 platforms we have ability to sent platform interrupts to the OS. a.k.a Corrected Platform Error Interrupts (CPEI). In current ACPI specifications, we didn't have a way to change the target CPU. Hence if the current ACPI version doesn't support such re-direction, we disable that CPU by making it not-removable. In such cases you will also notice that the online file is missing under cpu0. Q: How do i find out if a particular CPU is not removable? A: Depending on the implementation, some architectures may show this by the absence of the "online" file. This is done if it can be determined ahead of time that this CPU cannot be removed. In some situations, this can be a run time check, i.e if you try to remove the last CPU, this will not be permitted. You can find such failures by investigating the return value of the "echo" command. Q: What happens when a CPU is being logically offlined? A: The following happen, listed in no particular order :-) - A notification is sent to in-kernel registered modules by sending an event CPU_DOWN_PREPARE or CPU_DOWN_PREPARE_FROZEN, depending on whether or not the CPU is being offlined while tasks are frozen due to a suspend operation in progress - All processes are migrated away from this outgoing CPU to new CPUs. The new CPU is chosen from each process' current cpuset, which may be a subset of all online CPUs. - All interrupts targeted to this CPU is migrated to a new CPU - timers/bottom half/task lets are also migrated to a new CPU - Once all services are migrated, kernel calls an arch specific routine __cpu_disable() to perform arch specific cleanup. - Once this is successful, an event for successful cleanup is sent by an event CPU_DEAD (or CPU_DEAD_FROZEN if tasks are frozen due to a suspend while the CPU is being offlined). "It is expected that each service cleans up when the CPU_DOWN_PREPARE notifier is called, when CPU_DEAD is called its expected there is nothing running on behalf of this CPU that was offlined" Q: If i have some kernel code that needs to be aware of CPU arrival and departure, how to i arrange for proper notification? A: This is what you would need in your kernel code to receive notifications. #include static int __cpuinit foobar_cpu_callback(struct notifier_block *nfb, unsigned long action, void *hcpu) { unsigned int cpu = (unsigned long)hcpu; switch (action) { case CPU_ONLINE: case CPU_ONLINE_FROZEN: foobar_online_action(cpu); break; case CPU_DEAD: case CPU_DEAD_FROZEN: foobar_dead_action(cpu); break; } return NOTIFY_OK; } static struct notifier_block __cpuinitdata foobar_cpu_notifer = { .notifier_call = foobar_cpu_callback, }; You need to call register_cpu_notifier() from your init function. Init functions could be of two types: 1. early init (init function called when only the boot processor is online). 2. late init (init function called _after_ all the CPUs are online). For the first case, you should add the following to your init function register_cpu_notifier(&foobar_cpu_notifier); For the second case, you should add the following to your init function register_hotcpu_notifier(&foobar_cpu_notifier); You can fail PREPARE notifiers if something doesn't work to prepare resources. This will stop the activity and send a following CANCELED event back. CPU_DEAD should not be failed, its just a goodness indication, but bad things will happen if a notifier in path sent a BAD notify code. Q: I don't see my action being called for all CPUs already up and running? A: Yes, CPU notifiers are called only when new CPUs are on-lined or offlined. If you need to perform some action for each cpu already in the system, then for_each_online_cpu(i) { foobar_cpu_callback(&foobar_cpu_notifier, CPU_UP_PREPARE, i); foobar_cpu_callback(&foobar_cpu_notifier, CPU_ONLINE, i); } Q: If i would like to develop cpu hotplug support for a new architecture, what do i need at a minimum? A: The following are what is required for CPU hotplug infrastructure to work correctly. - Make sure you have an entry in Kconfig to enable CONFIG_HOTPLUG_CPU - __cpu_up() - Arch interface to bring up a CPU - __cpu_disable() - Arch interface to shutdown a CPU, no more interrupts can be handled by the kernel after the routine returns. Including local APIC timers etc are shutdown. - __cpu_die() - This actually supposed to ensure death of the CPU. Actually look at some example code in other arch that implement CPU hotplug. The processor is taken down from the idle() loop for that specific architecture. __cpu_die() typically waits for some per_cpu state to be set, to ensure the processor dead routine is called to be sure positively. Q: I need to ensure that a particular cpu is not removed when there is some work specific to this cpu is in progress. A: There are two ways. If your code can be run in interrupt context, use smp_call_function_single(), otherwise use work_on_cpu(). Note that work_on_cpu() is slow, and can fail due to out of memory: int my_func_on_cpu(int cpu) { int err; get_online_cpus(); if (!cpu_online(cpu)) err = -EINVAL; else #if NEEDS_BLOCKING err = work_on_cpu(cpu, __my_func_on_cpu, NULL); #else smp_call_function_single(cpu, __my_func_on_cpu, &err, true); #endif put_online_cpus(); return err; } Q: How do we determine how many CPUs are available for hotplug. A: There is no clear spec defined way from ACPI that can give us that information today. Based on some input from Natalie of Unisys, that the ACPI MADT (Multiple APIC Description Tables) marks those possible CPUs in a system with disabled status. Andi implemented some simple heuristics that count the number of disabled CPUs in MADT as hotpluggable CPUS. In the case there are no disabled CPUS we assume 1/2 the number of CPUs currently present can be hotplugged. Caveat: Today's ACPI MADT can only provide 256 entries since the apicid field in MADT is only 8 bits. User Space Notification Hotplug support for devices is common in Linux today. Its being used today to support automatic configuration of network, usb and pci devices. A hotplug event can be used to invoke an agent script to perform the configuration task. You can add /etc/hotplug/cpu.agent to handle hotplug notification user space scripts. #!/bin/bash # $Id: cpu.agent # Kernel hotplug params include: #ACTION=%s [online or offline] #DEVPATH=%s # cd /etc/hotplug . ./hotplug.functions case $ACTION in online) echo `date` ":cpu.agent" add cpu >> /tmp/hotplug.txt ;; offline) echo `date` ":cpu.agent" remove cpu >>/tmp/hotplug.txt ;; *) debug_mesg CPU $ACTION event not supported exit 1 ;; esac CPU load -------- Linux exports various bits of information via `/proc/stat' and `/proc/uptime' that userland tools, such as top(1), use to calculate the average time system spent in a particular state, for example: $ iostat Linux 2.6.18.3-exp (linmac) 02/20/2007 avg-cpu: %user %nice %system %iowait %steal %idle 10.01 0.00 2.92 5.44 0.00 81.63 ... Here the system thinks that over the default sampling period the system spent 10.01% of the time doing work in user space, 2.92% in the kernel, and was overall 81.63% of the time idle. In most cases the `/proc/stat' information reflects the reality quite closely, however due to the nature of how/when the kernel collects this data sometimes it can not be trusted at all. So how is this information collected? Whenever timer interrupt is signalled the kernel looks what kind of task was running at this moment and increments the counter that corresponds to this tasks kind/state. The problem with this is that the system could have switched between various states multiple times between two timer interrupts yet the counter is incremented only for the last state. Example ------- If we imagine the system with one task that periodically burns cycles in the following manner: time line between two timer interrupts |--------------------------------------| ^ ^ |_ something begins working | |_ something goes to sleep (only to be awaken quite soon) In the above situation the system will be 0% loaded according to the `/proc/stat' (since the timer interrupt will always happen when the system is executing the idle handler), but in reality the load is closer to 99%. One can imagine many more situations where this behavior of the kernel will lead to quite erratic information inside `/proc/stat'. /* gcc -o hog smallhog.c */ #include #include #include #include #define HIST 10 static volatile sig_atomic_t stop; static void sighandler (int signr) { (void) signr; stop = 1; } static unsigned long hog (unsigned long niters) { stop = 0; while (!stop && --niters); return niters; } int main (void) { int i; struct itimerval it = { .it_interval = { .tv_sec = 0, .tv_usec = 1 }, .it_value = { .tv_sec = 0, .tv_usec = 1 } }; sigset_t set; unsigned long v[HIST]; double tmp = 0.0; unsigned long n; signal (SIGALRM, &sighandler); setitimer (ITIMER_REAL, &it, NULL); hog (ULONG_MAX); for (i = 0; i < HIST; ++i) v[i] = ULONG_MAX - hog (ULONG_MAX); for (i = 0; i < HIST; ++i) tmp += v[i]; tmp /= HIST; n = tmp - (tmp / 3.0); sigemptyset (&set); sigaddset (&set, SIGALRM); for (;;) { hog (n); sigwait (&set, &i); } return 0; } References ---------- http://lkml.org/lkml/2007/2/12/6 Documentation/filesystems/proc.txt (1.8) Thanks ------ Con Kolivas, Pavel Machek Export CPU topology info via sysfs. Items (attributes) are similar to /proc/cpuinfo. 1) /sys/devices/system/cpu/cpuX/topology/physical_package_id: physical package id of cpuX. Typically corresponds to a physical socket number, but the actual value is architecture and platform dependent. 2) /sys/devices/system/cpu/cpuX/topology/core_id: the CPU core ID of cpuX. Typically it is the hardware platform's identifier (rather than the kernel's). The actual value is architecture and platform dependent. 3) /sys/devices/system/cpu/cpuX/topology/book_id: the book ID of cpuX. Typically it is the hardware platform's identifier (rather than the kernel's). The actual value is architecture and platform dependent. 4) /sys/devices/system/cpu/cpuX/topology/thread_siblings: internel kernel map of cpuX's hardware threads within the same core as cpuX 5) /sys/devices/system/cpu/cpuX/topology/core_siblings: internal kernel map of cpuX's hardware threads within the same physical_package_id. 6) /sys/devices/system/cpu/cpuX/topology/book_siblings: internal kernel map of cpuX's hardware threads within the same book_id. To implement it in an architecture-neutral way, a new source file, drivers/base/topology.c, is to export the 4 or 6 attributes. The two book related sysfs files will only be created if CONFIG_SCHED_BOOK is selected. For an architecture to support this feature, it must define some of these macros in include/asm-XXX/topology.h: #define topology_physical_package_id(cpu) #define topology_core_id(cpu) #define topology_book_id(cpu) #define topology_thread_cpumask(cpu) #define topology_core_cpumask(cpu) #define topology_book_cpumask(cpu) The type of **_id is int. The type of siblings is (const) struct cpumask *. To be consistent on all architectures, include/linux/topology.h provides default definitions for any of the above macros that are not defined by include/asm-XXX/topology.h: 1) physical_package_id: -1 2) core_id: 0 3) thread_siblings: just the given CPU 4) core_siblings: just the given CPU For architectures that don't support books (CONFIG_SCHED_BOOK) there are no default definitions for topology_book_id() and topology_book_cpumask(). Additionally, CPU topology information is provided under /sys/devices/system/cpu and includes these files. The internal source for the output is in brackets ("[]"). kernel_max: the maximum CPU index allowed by the kernel configuration. [NR_CPUS-1] offline: CPUs that are not online because they have been HOTPLUGGED off (see cpu-hotplug.txt) or exceed the limit of CPUs allowed by the kernel configuration (kernel_max above). [~cpu_online_mask + cpus >= NR_CPUS] online: CPUs that are online and being scheduled [cpu_online_mask] possible: CPUs that have been allocated resources and can be brought online if they are present. [cpu_possible_mask] present: CPUs that have been identified as being present in the system. [cpu_present_mask] The format for the above output is compatible with cpulist_parse() [see ]. Some examples follow. In this example, there are 64 CPUs in the system but cpus 32-63 exceed the kernel max which is limited to 0..31 by the NR_CPUS config option being 32. Note also that CPUs 2 and 4-31 are not online but could be brought online as they are both present and possible. kernel_max: 31 offline: 2,4-31,32-63 online: 0-1,3 possible: 0-31 present: 0-31 In this example, the NR_CPUS config option is 128, but the kernel was started with possible_cpus=144. There are 4 CPUs in the system and cpu2 was manually taken offline (and is the only CPU that can be brought online.) kernel_max: 127 offline: 2,4-127,128-143 online: 0-1,3 possible: 0-127 present: 0-3 See cpu-hotplug.txt for the possible_cpus=NUM kernel start parameter as well as more information on the various cpumasks. Overview The Dell Systems Management Base Driver provides a sysfs interface for systems management software such as Dell OpenManage to perform system management interrupts and host control actions (system power cycle or power off after OS shutdown) on certain Dell systems. Dell OpenManage requires this driver on the following Dell PowerEdge systems: 300, 1300, 1400, 400SC, 500SC, 1500SC, 1550, 600SC, 1600SC, 650, 1655MC, 700, and 750. Other Dell software such as the open source libsmbios project is expected to make use of this driver, and it may include the use of this driver on other Dell systems. The Dell libsmbios project aims towards providing access to as much BIOS information as possible. See http://linux.dell.com/libsmbios/main/ for more information about the libsmbios project. System Management Interrupt On some Dell systems, systems management software must access certain management information via a system management interrupt (SMI). The SMI data buffer must reside in 32-bit address space, and the physical address of the buffer is required for the SMI. The driver maintains the memory required for the SMI and provides a way for the application to generate the SMI. The driver creates the following sysfs entries for systems management software to perform these system management interrupts: /sys/devices/platform/dcdbas/smi_data /sys/devices/platform/dcdbas/smi_data_buf_phys_addr /sys/devices/platform/dcdbas/smi_data_buf_size /sys/devices/platform/dcdbas/smi_request Systems management software must perform the following steps to execute a SMI using this driver: 1) Lock smi_data. 2) Write system management command to smi_data. 3) Write "1" to smi_request to generate a calling interface SMI or "2" to generate a raw SMI. 4) Read system management command response from smi_data. 5) Unlock smi_data. Host Control Action Dell OpenManage supports a host control feature that allows the administrator to perform a power cycle or power off of the system after the OS has finished shutting down. On some Dell systems, this host control feature requires that a driver perform a SMI after the OS has finished shutting down. The driver creates the following sysfs entries for systems management software to schedule the driver to perform a power cycle or power off host control action after the system has finished shutting down: /sys/devices/platform/dcdbas/host_control_action /sys/devices/platform/dcdbas/host_control_smi_type /sys/devices/platform/dcdbas/host_control_on_shutdown Dell OpenManage performs the following steps to execute a power cycle or power off host control action using this driver: 1) Write host control action to be performed to host_control_action. 2) Write type of SMI that driver needs to perform to host_control_smi_type. 3) Write "1" to host_control_on_shutdown to enable host control action. 4) Initiate OS shutdown. (Driver will perform host control SMI when it is notified that the OS has finished shutting down.) Host Control SMI Type The following table shows the value to write to host_control_smi_type to perform a power cycle or power off host control action: PowerEdge System Host Control SMI Type ---------------- --------------------- 300 HC_SMITYPE_TYPE1 1300 HC_SMITYPE_TYPE1 1400 HC_SMITYPE_TYPE2 500SC HC_SMITYPE_TYPE2 1500SC HC_SMITYPE_TYPE2 1550 HC_SMITYPE_TYPE2 600SC HC_SMITYPE_TYPE2 1600SC HC_SMITYPE_TYPE2 650 HC_SMITYPE_TYPE2 1655MC HC_SMITYPE_TYPE2 700 HC_SMITYPE_TYPE3 750 HC_SMITYPE_TYPE3 Debugging Modules after 2.6.3 ----------------------------- In almost all distributions, the kernel asks for modules which don't exist, such as "net-pf-10" or whatever. Changing "modprobe -q" to "succeed" in this case is hacky and breaks some setups, and also we want to know if it failed for the fallback code for old aliases in fs/char_dev.c, for example. In the past a debugging message which would fill people's logs was emitted. This debugging message has been removed. The correct way of debugging module problems is something like this: echo '#! /bin/sh' > /tmp/modprobe echo 'echo "$@" >> /tmp/modprobe.log' >> /tmp/modprobe echo 'exec /sbin/modprobe "$@"' >> /tmp/modprobe chmod a+x /tmp/modprobe echo /tmp/modprobe > /proc/sys/kernel/modprobe Note that the above applies only when the *kernel* is requesting that the module be loaded -- it won't have any effect if that module is being loaded explicitly using "modprobe" from userspace. Using physical DMA provided by OHCI-1394 FireWire controllers for debugging --------------------------------------------------------------------------- Introduction ------------ Basically all FireWire controllers which are in use today are compliant to the OHCI-1394 specification which defines the controller to be a PCI bus master which uses DMA to offload data transfers from the CPU and has a "Physical Response Unit" which executes specific requests by employing PCI-Bus master DMA after applying filters defined by the OHCI-1394 driver. Once properly configured, remote machines can send these requests to ask the OHCI-1394 controller to perform read and write requests on physical system memory and, for read requests, send the result of the physical memory read back to the requester. With that, it is possible to debug issues by reading interesting memory locations such as buffers like the printk buffer or the process table. Retrieving a full system memory dump is also possible over the FireWire, using data transfer rates in the order of 10MB/s or more. Memory access is currently limited to the low 4G of physical address space which can be a problem on IA64 machines where memory is located mostly above that limit, but it is rarely a problem on more common hardware such as hardware based on x86, x86-64 and PowerPC. Together with a early initialization of the OHCI-1394 controller for debugging, this facility proved most useful for examining long debugs logs in the printk buffer on to debug early boot problems in areas like ACPI where the system fails to boot and other means for debugging (serial port) are either not available (notebooks) or too slow for extensive debug information (like ACPI). Drivers ------- The ohci1394 driver in drivers/ieee1394 initializes the OHCI-1394 controllers to a working state and enables physical DMA by default for all remote nodes. This can be turned off by ohci1394's module parameter phys_dma=0. The alternative firewire-ohci driver in drivers/firewire uses filtered physical DMA by default, which is more secure but not suitable for remote debugging. Compile the driver with CONFIG_FIREWIRE_OHCI_REMOTE_DMA (Kernel hacking menu: Remote debugging over FireWire with firewire-ohci) to get unfiltered physical DMA. Because ohci1394 and firewire-ohci depend on the PCI enumeration to be completed, an initialization routine which runs pretty early has been implemented for x86. This routine runs long before console_init() can be called, i.e. before the printk buffer appears on the console. To activate it, enable CONFIG_PROVIDE_OHCI1394_DMA_INIT (Kernel hacking menu: Remote debugging over FireWire early on boot) and pass the parameter "ohci1394_dma=early" to the recompiled kernel on boot. Tools ----- firescope - Originally developed by Benjamin Herrenschmidt, Andi Kleen ported it from PowerPC to x86 and x86_64 and added functionality, firescope can now be used to view the printk buffer of a remote machine, even with live update. Bernhard Kaindl enhanced firescope to support accessing 64-bit machines from 32-bit firescope and vice versa: - http://halobates.de/firewire/firescope-0.2.2.tar.bz2 and he implemented fast system dump (alpha version - read README.txt): - http://halobates.de/firewire/firedump-0.1.tar.bz2 There is also a gdb proxy for firewire which allows to use gdb to access data which can be referenced from symbols found by gdb in vmlinux: - http://halobates.de/firewire/fireproxy-0.33.tar.bz2 The latest version of this gdb proxy (fireproxy-0.34) can communicate (not yet stable) with kgdb over an memory-based communication module (kgdbom). Getting Started --------------- The OHCI-1394 specification regulates that the OHCI-1394 controller must disable all physical DMA on each bus reset. This means that if you want to debug an issue in a system state where interrupts are disabled and where no polling of the OHCI-1394 controller for bus resets takes place, you have to establish any FireWire cable connections and fully initialize all FireWire hardware __before__ the system enters such state. Step-by-step instructions for using firescope with early OHCI initialization: 1) Verify that your hardware is supported: Load the ohci1394 or the fw-ohci module and check your kernel logs. You should see a line similar to ohci1394: fw-host0: OHCI-1394 1.1 (PCI): IRQ=[18] MMIO=[fe9ff800-fe9fffff] ... Max Packet=[2048] IR/IT contexts=[4/8] when loading the driver. If you have no supported controller, many PCI, CardBus and even some Express cards which are fully compliant to OHCI-1394 specification are available. If it requires no driver for Windows operating systems, it most likely is. Only specialized shops have cards which are not compliant, they are based on TI PCILynx chips and require drivers for Win- dows operating systems. 2) Establish a working FireWire cable connection: Any FireWire cable, as long at it provides electrically and mechanically stable connection and has matching connectors (there are small 4-pin and large 6-pin FireWire ports) will do. If an driver is running on both machines you should see a line like ieee1394: Node added: ID:BUS[0-01:1023] GUID[0090270001b84bba] on both machines in the kernel log when the cable is plugged in and connects the two machines. 3) Test physical DMA using firescope: On the debug host, - load the raw1394 module, - make sure that /dev/raw1394 is accessible, then start firescope: $ firescope Port 0 (ohci1394) opened, 2 nodes detected FireScope --------- Target : Gen : 1 [Ctrl-T] choose target [Ctrl-H] this menu [Ctrl-Q] quit ------> Press Ctrl-T now, the output should be similar to: 2 nodes available, local node is: 0 0: ffc0, uuid: 00000000 00000000 [LOCAL] 1: ffc1, uuid: 00279000 ba4bb801 Besides the [LOCAL] node, it must show another node without error message. 4) Prepare for debugging with early OHCI-1394 initialization: 4.1) Kernel compilation and installation on debug target Compile the kernel to be debugged with CONFIG_PROVIDE_OHCI1394_DMA_INIT (Kernel hacking: Provide code for enabling DMA over FireWire early on boot) enabled and install it on the machine to be debugged (debug target). 4.2) Transfer the System.map of the debugged kernel to the debug host Copy the System.map of the kernel be debugged to the debug host (the host which is connected to the debugged machine over the FireWire cable). 5) Retrieving the printk buffer contents: With the FireWire cable connected, the OHCI-1394 driver on the debugging host loaded, reboot the debugged machine, booting the kernel which has CONFIG_PROVIDE_OHCI1394_DMA_INIT enabled, with the option ohci1394_dma=early. Then, on the debugging host, run firescope, for example by using -A: firescope -A System.map-of-debug-target-kernel Note: -A automatically attaches to the first non-local node. It only works reliably if only connected two machines are connected using FireWire. After having attached to the debug target, press Ctrl-D to view the complete printk buffer or Ctrl-U to enter auto update mode and get an updated live view of recent kernel messages logged on the debug target. Call "firescope -h" to get more information on firescope's options. Notes ----- Documentation and specifications: http://halobates.de/firewire/ FireWire is a trademark of Apple Inc. - for more information please refer to: http://en.wikipedia.org/wiki/FireWire Purpose: Demonstrate the usage of the new open sourced rbu (Remote BIOS Update) driver for updating BIOS images on Dell servers and desktops. Scope: This document discusses the functionality of the rbu driver only. It does not cover the support needed from applications to enable the BIOS to update itself with the image downloaded in to the memory. Overview: This driver works with Dell OpenManage or Dell Update Packages for updating the BIOS on Dell servers (starting from servers sold since 1999), desktops and notebooks (starting from those sold in 2005). Please go to http://support.dell.com register and you can find info on OpenManage and Dell Update packages (DUP). Libsmbios can also be used to update BIOS on Dell systems go to http://linux.dell.com/libsmbios/ for details. Dell_RBU driver supports BIOS update using the monolithic image and packetized image methods. In case of monolithic the driver allocates a contiguous chunk of physical pages having the BIOS image. In case of packetized the app using the driver breaks the image in to packets of fixed sizes and the driver would place each packet in contiguous physical memory. The driver also maintains a link list of packets for reading them back. If the dell_rbu driver is unloaded all the allocated memory is freed. The rbu driver needs to have an application (as mentioned above)which will inform the BIOS to enable the update in the next system reboot. The user should not unload the rbu driver after downloading the BIOS image or updating. The driver load creates the following directories under the /sys file system. /sys/class/firmware/dell_rbu/loading /sys/class/firmware/dell_rbu/data /sys/devices/platform/dell_rbu/image_type /sys/devices/platform/dell_rbu/data /sys/devices/platform/dell_rbu/packet_size The driver supports two types of update mechanism; monolithic and packetized. These update mechanism depends upon the BIOS currently running on the system. Most of the Dell systems support a monolithic update where the BIOS image is copied to a single contiguous block of physical memory. In case of packet mechanism the single memory can be broken in smaller chunks of contiguous memory and the BIOS image is scattered in these packets. By default the driver uses monolithic memory for the update type. This can be changed to packets during the driver load time by specifying the load parameter image_type=packet. This can also be changed later as below echo packet > /sys/devices/platform/dell_rbu/image_type In packet update mode the packet size has to be given before any packets can be downloaded. It is done as below echo XXXX > /sys/devices/platform/dell_rbu/packet_size In the packet update mechanism, the user needs to create a new file having packets of data arranged back to back. It can be done as follows The user creates packets header, gets the chunk of the BIOS image and places it next to the packetheader; now, the packetheader + BIOS image chunk added together should match the specified packet_size. This makes one packet, the user needs to create more such packets out of the entire BIOS image file and then arrange all these packets back to back in to one single file. This file is then copied to /sys/class/firmware/dell_rbu/data. Once this file gets to the driver, the driver extracts packet_size data from the file and spreads it across the physical memory in contiguous packet_sized space. This method makes sure that all the packets get to the driver in a single operation. In monolithic update the user simply get the BIOS image (.hdr file) and copies to the data file as is without any change to the BIOS image itself. Do the steps below to download the BIOS image. 1) echo 1 > /sys/class/firmware/dell_rbu/loading 2) cp bios_image.hdr /sys/class/firmware/dell_rbu/data 3) echo 0 > /sys/class/firmware/dell_rbu/loading The /sys/class/firmware/dell_rbu/ entries will remain till the following is done. echo -1 > /sys/class/firmware/dell_rbu/loading Until this step is completed the driver cannot be unloaded. Also echoing either mono, packet or init in to image_type will free up the memory allocated by the driver. If a user by accident executes steps 1 and 3 above without executing step 2; it will make the /sys/class/firmware/dell_rbu/ entries disappear. The entries can be recreated by doing the following echo init > /sys/devices/platform/dell_rbu/image_type NOTE: echoing init in image_type does not change it original value. Also the driver provides /sys/devices/platform/dell_rbu/data readonly file to read back the image downloaded. NOTE: This driver requires a patch for firmware_class.c which has the modified request_firmware_nowait function. Also after updating the BIOS image a user mode application needs to execute code which sends the BIOS update request to the BIOS. So on the next reboot the BIOS knows about the new image downloaded and it updates itself. Also don't unload the rbu driver if the image has to be updated. LINUX ALLOCATED DEVICES (2.6+ version) Maintained by Alan Cox Last revised: 6th April 2009 This list is the Linux Device List, the official registry of allocated device numbers and /dev directory nodes for the Linux operating system. The latest version of this list is available from http://www.lanana.org/docs/device-list/ or ftp://ftp.kernel.org/pub/linux/docs/device-list/. This version may be newer than the one distributed with the Linux kernel. The LaTeX version of this document is no longer maintained. This document is included by reference into the Filesystem Hierarchy Standard (FHS). The FHS is available from http://www.pathname.com/fhs/. Allocations marked (68k/Amiga) apply to Linux/68k on the Amiga platform only. Allocations marked (68k/Atari) apply to Linux/68k on the Atari platform only. The symbol {2.6} means the allocation is obsolete and scheduled for removal once kernel version 2.6 (or equivalent) is released. Some of these allocations have already been removed. This document is in the public domain. The author requests, however, that semantically altered versions are not distributed without permission of the author, assuming the author can be contacted without an unreasonable effort. In particular, please don't sent patches for this list to Linus, at least not without contacting me first. I do not have any information about these devices beyond what appears on this list. Any such information requests will be deleted without reply. **** DEVICE DRIVERS AUTHORS PLEASE READ THIS **** To have a major number allocated, or a minor number in situations where that applies (e.g. busmice), please contact me with the appropriate device information. Also, if you have additional information regarding any of the devices listed below, or if I have made a mistake, I would greatly appreciate a note. I do, however, make a few requests about the nature of your report. This is necessary for me to be able to keep this list up to date and correct in a timely manner. First of all, *please* send it to the correct address... . I receive hundreds of email messages a day, so mail sent to other addresses may very well get lost in the avalanche. Please put in a descriptive subject, so I can find your mail again should I need to. Too many people send me email saying just "device number request" in the subject. Second, please include a description of the device *in the same format as this list*. The reason for this is that it is the only way I have found to ensure I have all the requisite information to publish your device and avoid conflicts. Third, please don't assume that the distributed version of the list is up to date. Due to the number of registrations I have to maintain it in "batch mode", so there is likely additional registrations that haven't been listed yet. Fourth, remember that Linux now has extensive support for dynamic allocation of device numbering and can use sysfs and udev to handle the naming needs. There are still some exceptions in the serial and boot device area. Before asking for a device number make sure you actually need one. Finally, sometimes I have to play "namespace police." Please don't be offended. I often get submissions for /dev names that would be bound to cause conflicts down the road. I am trying to avoid getting in a situation where we would have to suffer an incompatible forward change. Therefore, please consult with me *before* you make your device names and numbers in any way public, at least to the point where it would be at all difficult to get them changed. Your cooperation is appreciated. 0 Unnamed devices (e.g. non-device mounts) 0 = reserved as null device number See block major 144, 145, 146 for expansion areas. 1 char Memory devices 1 = /dev/mem Physical memory access 2 = /dev/kmem Kernel virtual memory access 3 = /dev/null Null device 4 = /dev/port I/O port access 5 = /dev/zero Null byte source 6 = /dev/core OBSOLETE - replaced by /proc/kcore 7 = /dev/full Returns ENOSPC on write 8 = /dev/random Nondeterministic random number gen. 9 = /dev/urandom Faster, less secure random number gen. 10 = /dev/aio Asynchronous I/O notification interface 11 = /dev/kmsg Writes to this come out as printk's 12 = /dev/oldmem Used by crashdump kernels to access the memory of the kernel that crashed. 1 block RAM disk 0 = /dev/ram0 First RAM disk 1 = /dev/ram1 Second RAM disk ... 250 = /dev/initrd Initial RAM disk Older kernels had /dev/ramdisk (1, 1) here. /dev/initrd refers to a RAM disk which was preloaded by the boot loader; newer kernels use /dev/ram0 for the initrd. 2 char Pseudo-TTY masters 0 = /dev/ptyp0 First PTY master 1 = /dev/ptyp1 Second PTY master ... 255 = /dev/ptyef 256th PTY master Pseudo-tty's are named as follows: * Masters are "pty", slaves are "tty"; * the fourth letter is one of pqrstuvwxyzabcde indicating the 1st through 16th series of 16 pseudo-ttys each, and * the fifth letter is one of 0123456789abcdef indicating the position within the series. These are the old-style (BSD) PTY devices; Unix98 devices are on major 128 and above and use the PTY master multiplex (/dev/ptmx) to acquire a PTY on demand. 2 block Floppy disks 0 = /dev/fd0 Controller 0, drive 0, autodetect 1 = /dev/fd1 Controller 0, drive 1, autodetect 2 = /dev/fd2 Controller 0, drive 2, autodetect 3 = /dev/fd3 Controller 0, drive 3, autodetect 128 = /dev/fd4 Controller 1, drive 0, autodetect 129 = /dev/fd5 Controller 1, drive 1, autodetect 130 = /dev/fd6 Controller 1, drive 2, autodetect 131 = /dev/fd7 Controller 1, drive 3, autodetect To specify format, add to the autodetect device number: 0 = /dev/fd? Autodetect format 4 = /dev/fd?d360 5.25" 360K in a 360K drive(1) 20 = /dev/fd?h360 5.25" 360K in a 1200K drive(1) 48 = /dev/fd?h410 5.25" 410K in a 1200K drive 64 = /dev/fd?h420 5.25" 420K in a 1200K drive 24 = /dev/fd?h720 5.25" 720K in a 1200K drive 80 = /dev/fd?h880 5.25" 880K in a 1200K drive(1) 8 = /dev/fd?h1200 5.25" 1200K in a 1200K drive(1) 40 = /dev/fd?h1440 5.25" 1440K in a 1200K drive(1) 56 = /dev/fd?h1476 5.25" 1476K in a 1200K drive 72 = /dev/fd?h1494 5.25" 1494K in a 1200K drive 92 = /dev/fd?h1600 5.25" 1600K in a 1200K drive(1) 12 = /dev/fd?u360 3.5" 360K Double Density(2) 16 = /dev/fd?u720 3.5" 720K Double Density(1) 120 = /dev/fd?u800 3.5" 800K Double Density(2) 52 = /dev/fd?u820 3.5" 820K Double Density 68 = /dev/fd?u830 3.5" 830K Double Density 84 = /dev/fd?u1040 3.5" 1040K Double Density(1) 88 = /dev/fd?u1120 3.5" 1120K Double Density(1) 28 = /dev/fd?u1440 3.5" 1440K High Density(1) 124 = /dev/fd?u1600 3.5" 1600K High Density(1) 44 = /dev/fd?u1680 3.5" 1680K High Density(3) 60 = /dev/fd?u1722 3.5" 1722K High Density 76 = /dev/fd?u1743 3.5" 1743K High Density 96 = /dev/fd?u1760 3.5" 1760K High Density 116 = /dev/fd?u1840 3.5" 1840K High Density(3) 100 = /dev/fd?u1920 3.5" 1920K High Density(1) 32 = /dev/fd?u2880 3.5" 2880K Extra Density(1) 104 = /dev/fd?u3200 3.5" 3200K Extra Density 108 = /dev/fd?u3520 3.5" 3520K Extra Density 112 = /dev/fd?u3840 3.5" 3840K Extra Density(1) 36 = /dev/fd?CompaQ Compaq 2880K drive; obsolete? (1) Autodetectable format (2) Autodetectable format in a Double Density (720K) drive only (3) Autodetectable format in a High Density (1440K) drive only NOTE: The letter in the device name (d, q, h or u) signifies the type of drive: 5.25" Double Density (d), 5.25" Quad Density (q), 5.25" High Density (h) or 3.5" (any model, u). The use of the capital letters D, H and E for the 3.5" models have been deprecated, since the drive type is insignificant for these devices. 3 char Pseudo-TTY slaves 0 = /dev/ttyp0 First PTY slave 1 = /dev/ttyp1 Second PTY slave ... 255 = /dev/ttyef 256th PTY slave These are the old-style (BSD) PTY devices; Unix98 devices are on major 136 and above. 3 block First MFM, RLL and IDE hard disk/CD-ROM interface 0 = /dev/hda Master: whole disk (or CD-ROM) 64 = /dev/hdb Slave: whole disk (or CD-ROM) For partitions, add to the whole disk device number: 0 = /dev/hd? Whole disk 1 = /dev/hd?1 First partition 2 = /dev/hd?2 Second partition ... 63 = /dev/hd?63 63rd partition For Linux/i386, partitions 1-4 are the primary partitions, and 5 and above are logical partitions. Other versions of Linux use partitioning schemes appropriate to their respective architectures. 4 char TTY devices 0 = /dev/tty0 Current virtual console 1 = /dev/tty1 First virtual console ... 63 = /dev/tty63 63rd virtual console 64 = /dev/ttyS0 First UART serial port ... 255 = /dev/ttyS191 192nd UART serial port UART serial ports refer to 8250/16450/16550 series devices. Older versions of the Linux kernel used this major number for BSD PTY devices. As of Linux 2.1.115, this is no longer supported. Use major numbers 2 and 3. 4 block Aliases for dynamically allocated major devices to be used when its not possible to create the real device nodes because the root filesystem is mounted read-only. 0 = /dev/root 5 char Alternate TTY devices 0 = /dev/tty Current TTY device 1 = /dev/console System console 2 = /dev/ptmx PTY master multiplex 3 = /dev/ttyprintk User messages via printk TTY device 64 = /dev/cua0 Callout device for ttyS0 ... 255 = /dev/cua191 Callout device for ttyS191 (5,1) is /dev/console starting with Linux 2.1.71. See the section on terminal devices for more information on /dev/console. 6 char Parallel printer devices 0 = /dev/lp0 Parallel printer on parport0 1 = /dev/lp1 Parallel printer on parport1 ... Current Linux kernels no longer have a fixed mapping between parallel ports and I/O addresses. Instead, they are redirected through the parport multiplex layer. 7 char Virtual console capture devices 0 = /dev/vcs Current vc text contents 1 = /dev/vcs1 tty1 text contents ... 63 = /dev/vcs63 tty63 text contents 128 = /dev/vcsa Current vc text/attribute contents 129 = /dev/vcsa1 tty1 text/attribute contents ... 191 = /dev/vcsa63 tty63 text/attribute contents NOTE: These devices permit both read and write access. 7 block Loopback devices 0 = /dev/loop0 First loop device 1 = /dev/loop1 Second loop device ... The loop devices are used to mount filesystems not associated with block devices. The binding to the loop devices is handled by mount(8) or losetup(8). 8 block SCSI disk devices (0-15) 0 = /dev/sda First SCSI disk whole disk 16 = /dev/sdb Second SCSI disk whole disk 32 = /dev/sdc Third SCSI disk whole disk ... 240 = /dev/sdp Sixteenth SCSI disk whole disk Partitions are handled in the same way as for IDE disks (see major number 3) except that the limit on partitions is 15. 9 char SCSI tape devices 0 = /dev/st0 First SCSI tape, mode 0 1 = /dev/st1 Second SCSI tape, mode 0 ... 32 = /dev/st0l First SCSI tape, mode 1 33 = /dev/st1l Second SCSI tape, mode 1 ... 64 = /dev/st0m First SCSI tape, mode 2 65 = /dev/st1m Second SCSI tape, mode 2 ... 96 = /dev/st0a First SCSI tape, mode 3 97 = /dev/st1a Second SCSI tape, mode 3 ... 128 = /dev/nst0 First SCSI tape, mode 0, no rewind 129 = /dev/nst1 Second SCSI tape, mode 0, no rewind ... 160 = /dev/nst0l First SCSI tape, mode 1, no rewind 161 = /dev/nst1l Second SCSI tape, mode 1, no rewind ... 192 = /dev/nst0m First SCSI tape, mode 2, no rewind 193 = /dev/nst1m Second SCSI tape, mode 2, no rewind ... 224 = /dev/nst0a First SCSI tape, mode 3, no rewind 225 = /dev/nst1a Second SCSI tape, mode 3, no rewind ... "No rewind" refers to the omission of the default automatic rewind on device close. The MTREW or MTOFFL ioctl()'s can be used to rewind the tape regardless of the device used to access it. 9 block Metadisk (RAID) devices 0 = /dev/md0 First metadisk group 1 = /dev/md1 Second metadisk group ... The metadisk driver is used to span a filesystem across multiple physical disks. 10 char Non-serial mice, misc features 0 = /dev/logibm Logitech bus mouse 1 = /dev/psaux PS/2-style mouse port 2 = /dev/inportbm Microsoft Inport bus mouse 3 = /dev/atibm ATI XL bus mouse 4 = /dev/jbm J-mouse 4 = /dev/amigamouse Amiga mouse (68k/Amiga) 5 = /dev/atarimouse Atari mouse 6 = /dev/sunmouse Sun mouse 7 = /dev/amigamouse1 Second Amiga mouse 8 = /dev/smouse Simple serial mouse driver 9 = /dev/pc110pad IBM PC-110 digitizer pad 10 = /dev/adbmouse Apple Desktop Bus mouse 11 = /dev/vrtpanel Vr41xx embedded touch panel 13 = /dev/vpcmouse Connectix Virtual PC Mouse 14 = /dev/touchscreen/ucb1x00 UCB 1x00 touchscreen 15 = /dev/touchscreen/mk712 MK712 touchscreen 128 = /dev/beep Fancy beep device 129 = 130 = /dev/watchdog Watchdog timer port 131 = /dev/temperature Machine internal temperature 132 = /dev/hwtrap Hardware fault trap 133 = /dev/exttrp External device trap 134 = /dev/apm_bios Advanced Power Management BIOS 135 = /dev/rtc Real Time Clock 139 = /dev/openprom SPARC OpenBoot PROM 140 = /dev/relay8 Berkshire Products Octal relay card 141 = /dev/relay16 Berkshire Products ISO-16 relay card 142 = 143 = /dev/pciconf PCI configuration space 144 = /dev/nvram Non-volatile configuration RAM 145 = /dev/hfmodem Soundcard shortwave modem control 146 = /dev/graphics Linux/SGI graphics device 147 = /dev/opengl Linux/SGI OpenGL pipe 148 = /dev/gfx Linux/SGI graphics effects device 149 = /dev/input/mouse Linux/SGI Irix emulation mouse 150 = /dev/input/keyboard Linux/SGI Irix emulation keyboard 151 = /dev/led Front panel LEDs 152 = /dev/kpoll Kernel Poll Driver 153 = /dev/mergemem Memory merge device 154 = /dev/pmu Macintosh PowerBook power manager 155 = /dev/isictl MultiTech ISICom serial control 156 = /dev/lcd Front panel LCD display 157 = /dev/ac Applicom Intl Profibus card 158 = /dev/nwbutton Netwinder external button 159 = /dev/nwdebug Netwinder debug interface 160 = /dev/nwflash Netwinder flash memory 161 = /dev/userdma User-space DMA access 162 = /dev/smbus System Management Bus 163 = /dev/lik Logitech Internet Keyboard 164 = /dev/ipmo Intel Intelligent Platform Management 165 = /dev/vmmon VMware virtual machine monitor 166 = /dev/i2o/ctl I2O configuration manager 167 = /dev/specialix_sxctl Specialix serial control 168 = /dev/tcldrv Technology Concepts serial control 169 = /dev/specialix_rioctl Specialix RIO serial control 170 = /dev/thinkpad/thinkpad IBM Thinkpad devices 171 = /dev/srripc QNX4 API IPC manager 172 = /dev/usemaclone Semaphore clone device 173 = /dev/ipmikcs Intelligent Platform Management 174 = /dev/uctrl SPARCbook 3 microcontroller 175 = /dev/agpgart AGP Graphics Address Remapping Table 176 = /dev/gtrsc Gorgy Timing radio clock 177 = /dev/cbm Serial CBM bus 178 = /dev/jsflash JavaStation OS flash SIMM 179 = /dev/xsvc High-speed shared-mem/semaphore service 180 = /dev/vrbuttons Vr41xx button input device 181 = /dev/toshiba Toshiba laptop SMM support 182 = /dev/perfctr Performance-monitoring counters 183 = /dev/hwrng Generic random number generator 184 = /dev/cpu/microcode CPU microcode update interface 186 = /dev/atomicps Atomic shapshot of process state data 187 = /dev/irnet IrNET device 188 = /dev/smbusbios SMBus BIOS 189 = /dev/ussp_ctl User space serial port control 190 = /dev/crash Mission Critical Linux crash dump facility 191 = /dev/pcl181 192 = /dev/nas_xbus NAS xbus LCD/buttons access 193 = /dev/d7s SPARC 7-segment display 194 = /dev/zkshim Zero-Knowledge network shim control 195 = /dev/elographics/e2201 Elographics touchscreen E271-2201 198 = /dev/sexec Signed executable interface 199 = /dev/scanners/cuecat :CueCat barcode scanner 200 = /dev/net/tun TAP/TUN network device 201 = /dev/button/gulpb Transmeta GULP-B buttons 202 = /dev/emd/ctl Enhanced Metadisk RAID (EMD) control 204 = /dev/video/em8300 EM8300 DVD decoder control 205 = /dev/video/em8300_mv EM8300 DVD decoder video 206 = /dev/video/em8300_ma EM8300 DVD decoder audio 207 = /dev/video/em8300_sp EM8300 DVD decoder subpicture 208 = /dev/compaq/cpqphpc Compaq PCI Hot Plug Controller 209 = /dev/compaq/cpqrid Compaq Remote Insight Driver 210 = /dev/impi/bt IMPI coprocessor block transfer 211 = /dev/impi/smic IMPI coprocessor stream interface 212 = /dev/watchdogs/0 First watchdog device 213 = /dev/watchdogs/1 Second watchdog device 214 = /dev/watchdogs/2 Third watchdog device 215 = /dev/watchdogs/3 Fourth watchdog device 216 = /dev/fujitsu/apanel Fujitsu/Siemens application panel 217 = /dev/ni/natmotn National Instruments Motion 218 = /dev/kchuid Inter-process chuid control 219 = /dev/modems/mwave MWave modem firmware upload 220 = /dev/mptctl Message passing technology (MPT) control 221 = /dev/mvista/hssdsi Montavista PICMG hot swap system driver 222 = /dev/mvista/hasi Montavista PICMG high availability 223 = /dev/input/uinput User level driver support for input 224 = /dev/tpm TCPA TPM driver 225 = /dev/pps Pulse Per Second driver 226 = /dev/systrace Systrace device 227 = /dev/mcelog X86_64 Machine Check Exception driver 228 = /dev/hpet HPET driver 229 = /dev/fuse Fuse (virtual filesystem in user-space) 230 = /dev/midishare MidiShare driver 231 = /dev/snapshot System memory snapshot device 232 = /dev/kvm Kernel-based virtual machine (hardware virtualization extensions) 233 = /dev/kmview View-OS A process with a view 234 = /dev/btrfs-control Btrfs control device 235 = /dev/autofs Autofs control device 236 = /dev/mapper/control Device-Mapper control device 237 = /dev/loop-control Loopback control device 238 = /dev/vhost-net Host kernel accelerator for virtio net 240-254 Reserved for local use 255 Reserved for MISC_DYNAMIC_MINOR 11 char Raw keyboard device (Linux/SPARC only) 0 = /dev/kbd Raw keyboard device 11 char Serial Mux device (Linux/PA-RISC only) 0 = /dev/ttyB0 First mux port 1 = /dev/ttyB1 Second mux port ... 11 block SCSI CD-ROM devices 0 = /dev/scd0 First SCSI CD-ROM 1 = /dev/scd1 Second SCSI CD-ROM ... The prefix /dev/sr (instead of /dev/scd) has been deprecated. 12 char QIC-02 tape 2 = /dev/ntpqic11 QIC-11, no rewind-on-close 3 = /dev/tpqic11 QIC-11, rewind-on-close 4 = /dev/ntpqic24 QIC-24, no rewind-on-close 5 = /dev/tpqic24 QIC-24, rewind-on-close 6 = /dev/ntpqic120 QIC-120, no rewind-on-close 7 = /dev/tpqic120 QIC-120, rewind-on-close 8 = /dev/ntpqic150 QIC-150, no rewind-on-close 9 = /dev/tpqic150 QIC-150, rewind-on-close The device names specified are proposed -- if there are "standard" names for these devices, please let me know. 12 block 13 char Input core 0 = /dev/input/js0 First joystick 1 = /dev/input/js1 Second joystick ... 32 = /dev/input/mouse0 First mouse 33 = /dev/input/mouse1 Second mouse ... 63 = /dev/input/mice Unified mouse 64 = /dev/input/event0 First event queue 65 = /dev/input/event1 Second event queue ... Each device type has 5 bits (32 minors). 13 block 8-bit MFM/RLL/IDE controller 0 = /dev/xda First XT disk whole disk 64 = /dev/xdb Second XT disk whole disk Partitions are handled in the same way as IDE disks (see major number 3). 14 char Open Sound System (OSS) 0 = /dev/mixer Mixer control 1 = /dev/sequencer Audio sequencer 2 = /dev/midi00 First MIDI port 3 = /dev/dsp Digital audio 4 = /dev/audio Sun-compatible digital audio 6 = 7 = /dev/audioctl SPARC audio control device 8 = /dev/sequencer2 Sequencer -- alternate device 16 = /dev/mixer1 Second soundcard mixer control 17 = /dev/patmgr0 Sequencer patch manager 18 = /dev/midi01 Second MIDI port 19 = /dev/dsp1 Second soundcard digital audio 20 = /dev/audio1 Second soundcard Sun digital audio 33 = /dev/patmgr1 Sequencer patch manager 34 = /dev/midi02 Third MIDI port 50 = /dev/midi03 Fourth MIDI port 14 block 15 char Joystick 0 = /dev/js0 First analog joystick 1 = /dev/js1 Second analog joystick ... 128 = /dev/djs0 First digital joystick 129 = /dev/djs1 Second digital joystick ... 15 block Sony CDU-31A/CDU-33A CD-ROM 0 = /dev/sonycd Sony CDU-31a CD-ROM 16 char Non-SCSI scanners 0 = /dev/gs4500 Genius 4500 handheld scanner 16 block GoldStar CD-ROM 0 = /dev/gscd GoldStar CD-ROM 17 char OBSOLETE (was Chase serial card) 0 = /dev/ttyH0 First Chase port 1 = /dev/ttyH1 Second Chase port ... 17 block Optics Storage CD-ROM 0 = /dev/optcd Optics Storage CD-ROM 18 char OBSOLETE (was Chase serial card - alternate devices) 0 = /dev/cuh0 Callout device for ttyH0 1 = /dev/cuh1 Callout device for ttyH1 ... 18 block Sanyo CD-ROM 0 = /dev/sjcd Sanyo CD-ROM 19 char Cyclades serial card 0 = /dev/ttyC0 First Cyclades port ... 31 = /dev/ttyC31 32nd Cyclades port 19 block "Double" compressed disk 0 = /dev/double0 First compressed disk ... 7 = /dev/double7 Eighth compressed disk 128 = /dev/cdouble0 Mirror of first compressed disk ... 135 = /dev/cdouble7 Mirror of eighth compressed disk See the Double documentation for the meaning of the mirror devices. 20 char Cyclades serial card - alternate devices 0 = /dev/cub0 Callout device for ttyC0 ... 31 = /dev/cub31 Callout device for ttyC31 20 block Hitachi CD-ROM (under development) 0 = /dev/hitcd Hitachi CD-ROM 21 char Generic SCSI access 0 = /dev/sg0 First generic SCSI device 1 = /dev/sg1 Second generic SCSI device ... Most distributions name these /dev/sga, /dev/sgb...; this sets an unnecessary limit of 26 SCSI devices in the system and is counter to standard Linux device-naming practice. 21 block Acorn MFM hard drive interface 0 = /dev/mfma First MFM drive whole disk 64 = /dev/mfmb Second MFM drive whole disk This device is used on the ARM-based Acorn RiscPC. Partitions are handled the same way as for IDE disks (see major number 3). 22 char Digiboard serial card 0 = /dev/ttyD0 First Digiboard port 1 = /dev/ttyD1 Second Digiboard port ... 22 block Second IDE hard disk/CD-ROM interface 0 = /dev/hdc Master: whole disk (or CD-ROM) 64 = /dev/hdd Slave: whole disk (or CD-ROM) Partitions are handled the same way as for the first interface (see major number 3). 23 char Digiboard serial card - alternate devices 0 = /dev/cud0 Callout device for ttyD0 1 = /dev/cud1 Callout device for ttyD1 ... 23 block Mitsumi proprietary CD-ROM 0 = /dev/mcd Mitsumi CD-ROM 24 char Stallion serial card 0 = /dev/ttyE0 Stallion port 0 card 0 1 = /dev/ttyE1 Stallion port 1 card 0 ... 64 = /dev/ttyE64 Stallion port 0 card 1 65 = /dev/ttyE65 Stallion port 1 card 1 ... 128 = /dev/ttyE128 Stallion port 0 card 2 129 = /dev/ttyE129 Stallion port 1 card 2 ... 192 = /dev/ttyE192 Stallion port 0 card 3 193 = /dev/ttyE193 Stallion port 1 card 3 ... 24 block Sony CDU-535 CD-ROM 0 = /dev/cdu535 Sony CDU-535 CD-ROM 25 char Stallion serial card - alternate devices 0 = /dev/cue0 Callout device for ttyE0 1 = /dev/cue1 Callout device for ttyE1 ... 64 = /dev/cue64 Callout device for ttyE64 65 = /dev/cue65 Callout device for ttyE65 ... 128 = /dev/cue128 Callout device for ttyE128 129 = /dev/cue129 Callout device for ttyE129 ... 192 = /dev/cue192 Callout device for ttyE192 193 = /dev/cue193 Callout device for ttyE193 ... 25 block First Matsushita (Panasonic/SoundBlaster) CD-ROM 0 = /dev/sbpcd0 Panasonic CD-ROM controller 0 unit 0 1 = /dev/sbpcd1 Panasonic CD-ROM controller 0 unit 1 2 = /dev/sbpcd2 Panasonic CD-ROM controller 0 unit 2 3 = /dev/sbpcd3 Panasonic CD-ROM controller 0 unit 3 26 char 26 block Second Matsushita (Panasonic/SoundBlaster) CD-ROM 0 = /dev/sbpcd4 Panasonic CD-ROM controller 1 unit 0 1 = /dev/sbpcd5 Panasonic CD-ROM controller 1 unit 1 2 = /dev/sbpcd6 Panasonic CD-ROM controller 1 unit 2 3 = /dev/sbpcd7 Panasonic CD-ROM controller 1 unit 3 27 char QIC-117 tape 0 = /dev/qft0 Unit 0, rewind-on-close 1 = /dev/qft1 Unit 1, rewind-on-close 2 = /dev/qft2 Unit 2, rewind-on-close 3 = /dev/qft3 Unit 3, rewind-on-close 4 = /dev/nqft0 Unit 0, no rewind-on-close 5 = /dev/nqft1 Unit 1, no rewind-on-close 6 = /dev/nqft2 Unit 2, no rewind-on-close 7 = /dev/nqft3 Unit 3, no rewind-on-close 16 = /dev/zqft0 Unit 0, rewind-on-close, compression 17 = /dev/zqft1 Unit 1, rewind-on-close, compression 18 = /dev/zqft2 Unit 2, rewind-on-close, compression 19 = /dev/zqft3 Unit 3, rewind-on-close, compression 20 = /dev/nzqft0 Unit 0, no rewind-on-close, compression 21 = /dev/nzqft1 Unit 1, no rewind-on-close, compression 22 = /dev/nzqft2 Unit 2, no rewind-on-close, compression 23 = /dev/nzqft3 Unit 3, no rewind-on-close, compression 32 = /dev/rawqft0 Unit 0, rewind-on-close, no file marks 33 = /dev/rawqft1 Unit 1, rewind-on-close, no file marks 34 = /dev/rawqft2 Unit 2, rewind-on-close, no file marks 35 = /dev/rawqft3 Unit 3, rewind-on-close, no file marks 36 = /dev/nrawqft0 Unit 0, no rewind-on-close, no file marks 37 = /dev/nrawqft1 Unit 1, no rewind-on-close, no file marks 38 = /dev/nrawqft2 Unit 2, no rewind-on-close, no file marks 39 = /dev/nrawqft3 Unit 3, no rewind-on-close, no file marks 27 block Third Matsushita (Panasonic/SoundBlaster) CD-ROM 0 = /dev/sbpcd8 Panasonic CD-ROM controller 2 unit 0 1 = /dev/sbpcd9 Panasonic CD-ROM controller 2 unit 1 2 = /dev/sbpcd10 Panasonic CD-ROM controller 2 unit 2 3 = /dev/sbpcd11 Panasonic CD-ROM controller 2 unit 3 28 char Stallion serial card - card programming 0 = /dev/staliomem0 First Stallion card I/O memory 1 = /dev/staliomem1 Second Stallion card I/O memory 2 = /dev/staliomem2 Third Stallion card I/O memory 3 = /dev/staliomem3 Fourth Stallion card I/O memory 28 char Atari SLM ACSI laser printer (68k/Atari) 0 = /dev/slm0 First SLM laser printer 1 = /dev/slm1 Second SLM laser printer ... 28 block Fourth Matsushita (Panasonic/SoundBlaster) CD-ROM 0 = /dev/sbpcd12 Panasonic CD-ROM controller 3 unit 0 1 = /dev/sbpcd13 Panasonic CD-ROM controller 3 unit 1 2 = /dev/sbpcd14 Panasonic CD-ROM controller 3 unit 2 3 = /dev/sbpcd15 Panasonic CD-ROM controller 3 unit 3 28 block ACSI disk (68k/Atari) 0 = /dev/ada First ACSI disk whole disk 16 = /dev/adb Second ACSI disk whole disk 32 = /dev/adc Third ACSI disk whole disk ... 240 = /dev/adp 16th ACSI disk whole disk Partitions are handled in the same way as for IDE disks (see major number 3) except that the limit on partitions is 15, like SCSI. 29 char Universal frame buffer 0 = /dev/fb0 First frame buffer 1 = /dev/fb1 Second frame buffer ... 31 = /dev/fb31 32nd frame buffer 29 block Aztech/Orchid/Okano/Wearnes CD-ROM 0 = /dev/aztcd Aztech CD-ROM 30 char iBCS-2 compatibility devices 0 = /dev/socksys Socket access 1 = /dev/spx SVR3 local X interface 32 = /dev/inet/ip Network access 33 = /dev/inet/icmp 34 = /dev/inet/ggp 35 = /dev/inet/ipip 36 = /dev/inet/tcp 37 = /dev/inet/egp 38 = /dev/inet/pup 39 = /dev/inet/udp 40 = /dev/inet/idp 41 = /dev/inet/rawip Additionally, iBCS-2 requires the following links: /dev/ip -> /dev/inet/ip /dev/icmp -> /dev/inet/icmp /dev/ggp -> /dev/inet/ggp /dev/ipip -> /dev/inet/ipip /dev/tcp -> /dev/inet/tcp /dev/egp -> /dev/inet/egp /dev/pup -> /dev/inet/pup /dev/udp -> /dev/inet/udp /dev/idp -> /dev/inet/idp /dev/rawip -> /dev/inet/rawip /dev/inet/arp -> /dev/inet/udp /dev/inet/rip -> /dev/inet/udp /dev/nfsd -> /dev/socksys /dev/X0R -> /dev/null (? apparently not required ?) 30 block Philips LMS CM-205 CD-ROM 0 = /dev/cm205cd Philips LMS CM-205 CD-ROM /dev/lmscd is an older name for this device. This driver does not work with the CM-205MS CD-ROM. 31 char MPU-401 MIDI 0 = /dev/mpu401data MPU-401 data port 1 = /dev/mpu401stat MPU-401 status port 31 block ROM/flash memory card 0 = /dev/rom0 First ROM card (rw) ... 7 = /dev/rom7 Eighth ROM card (rw) 8 = /dev/rrom0 First ROM card (ro) ... 15 = /dev/rrom7 Eighth ROM card (ro) 16 = /dev/flash0 First flash memory card (rw) ... 23 = /dev/flash7 Eighth flash memory card (rw) 24 = /dev/rflash0 First flash memory card (ro) ... 31 = /dev/rflash7 Eighth flash memory card (ro) The read-write (rw) devices support back-caching written data in RAM, as well as writing to flash RAM devices. The read-only devices (ro) support reading only. 32 char Specialix serial card 0 = /dev/ttyX0 First Specialix port 1 = /dev/ttyX1 Second Specialix port ... 32 block Philips LMS CM-206 CD-ROM 0 = /dev/cm206cd Philips LMS CM-206 CD-ROM 33 char Specialix serial card - alternate devices 0 = /dev/cux0 Callout device for ttyX0 1 = /dev/cux1 Callout device for ttyX1 ... 33 block Third IDE hard disk/CD-ROM interface 0 = /dev/hde Master: whole disk (or CD-ROM) 64 = /dev/hdf Slave: whole disk (or CD-ROM) Partitions are handled the same way as for the first interface (see major number 3). 34 char Z8530 HDLC driver 0 = /dev/scc0 First Z8530, first port 1 = /dev/scc1 First Z8530, second port 2 = /dev/scc2 Second Z8530, first port 3 = /dev/scc3 Second Z8530, second port ... In a previous version these devices were named /dev/sc1 for /dev/scc0, /dev/sc2 for /dev/scc1, and so on. 34 block Fourth IDE hard disk/CD-ROM interface 0 = /dev/hdg Master: whole disk (or CD-ROM) 64 = /dev/hdh Slave: whole disk (or CD-ROM) Partitions are handled the same way as for the first interface (see major number 3). 35 char tclmidi MIDI driver 0 = /dev/midi0 First MIDI port, kernel timed 1 = /dev/midi1 Second MIDI port, kernel timed 2 = /dev/midi2 Third MIDI port, kernel timed 3 = /dev/midi3 Fourth MIDI port, kernel timed 64 = /dev/rmidi0 First MIDI port, untimed 65 = /dev/rmidi1 Second MIDI port, untimed 66 = /dev/rmidi2 Third MIDI port, untimed 67 = /dev/rmidi3 Fourth MIDI port, untimed 128 = /dev/smpte0 First MIDI port, SMPTE timed 129 = /dev/smpte1 Second MIDI port, SMPTE timed 130 = /dev/smpte2 Third MIDI port, SMPTE timed 131 = /dev/smpte3 Fourth MIDI port, SMPTE timed 35 block Slow memory ramdisk 0 = /dev/slram Slow memory ramdisk 36 char Netlink support 0 = /dev/route Routing, device updates, kernel to user 1 = /dev/skip enSKIP security cache control 3 = /dev/fwmonitor Firewall packet copies 16 = /dev/tap0 First Ethertap device ... 31 = /dev/tap15 16th Ethertap device 36 block MCA ESDI hard disk 0 = /dev/eda First ESDI disk whole disk 64 = /dev/edb Second ESDI disk whole disk ... Partitions are handled in the same way as IDE disks (see major number 3). 37 char IDE tape 0 = /dev/ht0 First IDE tape 1 = /dev/ht1 Second IDE tape ... 128 = /dev/nht0 First IDE tape, no rewind-on-close 129 = /dev/nht1 Second IDE tape, no rewind-on-close ... Currently, only one IDE tape drive is supported. 37 block Zorro II ramdisk 0 = /dev/z2ram Zorro II ramdisk 38 char Myricom PCI Myrinet board 0 = /dev/mlanai0 First Myrinet board 1 = /dev/mlanai1 Second Myrinet board ... This device is used for status query, board control and "user level packet I/O." This board is also accessible as a standard networking "eth" device. 38 block OBSOLETE (was Linux/AP+) 39 char ML-16P experimental I/O board 0 = /dev/ml16pa-a0 First card, first analog channel 1 = /dev/ml16pa-a1 First card, second analog channel ... 15 = /dev/ml16pa-a15 First card, 16th analog channel 16 = /dev/ml16pa-d First card, digital lines 17 = /dev/ml16pa-c0 First card, first counter/timer 18 = /dev/ml16pa-c1 First card, second counter/timer 19 = /dev/ml16pa-c2 First card, third counter/timer 32 = /dev/ml16pb-a0 Second card, first analog channel 33 = /dev/ml16pb-a1 Second card, second analog channel ... 47 = /dev/ml16pb-a15 Second card, 16th analog channel 48 = /dev/ml16pb-d Second card, digital lines 49 = /dev/ml16pb-c0 Second card, first counter/timer 50 = /dev/ml16pb-c1 Second card, second counter/timer 51 = /dev/ml16pb-c2 Second card, third counter/timer ... 39 block 40 char 40 block 41 char Yet Another Micro Monitor 0 = /dev/yamm Yet Another Micro Monitor 41 block 42 char Demo/sample use 42 block Demo/sample use This number is intended for use in sample code, as well as a general "example" device number. It should never be used for a device driver that is being distributed; either obtain an official number or use the local/experimental range. The sudden addition or removal of a driver with this number should not cause ill effects to the system (bugs excepted.) IN PARTICULAR, ANY DISTRIBUTION WHICH CONTAINS A DEVICE DRIVER USING MAJOR NUMBER 42 IS NONCOMPLIANT. 43 char isdn4linux virtual modem 0 = /dev/ttyI0 First virtual modem ... 63 = /dev/ttyI63 64th virtual modem 43 block Network block devices 0 = /dev/nb0 First network block device 1 = /dev/nb1 Second network block device ... Network Block Device is somehow similar to loopback devices: If you read from it, it sends packet across network asking server for data. If you write to it, it sends packet telling server to write. It could be used to mounting filesystems over the net, swapping over the net, implementing block device in userland etc. 44 char isdn4linux virtual modem - alternate devices 0 = /dev/cui0 Callout device for ttyI0 ... 63 = /dev/cui63 Callout device for ttyI63 44 block Flash Translation Layer (FTL) filesystems 0 = /dev/ftla FTL on first Memory Technology Device 16 = /dev/ftlb FTL on second Memory Technology Device 32 = /dev/ftlc FTL on third Memory Technology Device ... 240 = /dev/ftlp FTL on 16th Memory Technology Device Partitions are handled in the same way as for IDE disks (see major number 3) except that the partition limit is 15 rather than 63 per disk (same as SCSI.) 45 char isdn4linux ISDN BRI driver 0 = /dev/isdn0 First virtual B channel raw data ... 63 = /dev/isdn63 64th virtual B channel raw data 64 = /dev/isdnctrl0 First channel control/debug ... 127 = /dev/isdnctrl63 64th channel control/debug 128 = /dev/ippp0 First SyncPPP device ... 191 = /dev/ippp63 64th SyncPPP device 255 = /dev/isdninfo ISDN monitor interface 45 block Parallel port IDE disk devices 0 = /dev/pda First parallel port IDE disk 16 = /dev/pdb Second parallel port IDE disk 32 = /dev/pdc Third parallel port IDE disk 48 = /dev/pdd Fourth parallel port IDE disk Partitions are handled in the same way as for IDE disks (see major number 3) except that the partition limit is 15 rather than 63 per disk. 46 char Comtrol Rocketport serial card 0 = /dev/ttyR0 First Rocketport port 1 = /dev/ttyR1 Second Rocketport port ... 46 block Parallel port ATAPI CD-ROM devices 0 = /dev/pcd0 First parallel port ATAPI CD-ROM 1 = /dev/pcd1 Second parallel port ATAPI CD-ROM 2 = /dev/pcd2 Third parallel port ATAPI CD-ROM 3 = /dev/pcd3 Fourth parallel port ATAPI CD-ROM 47 char Comtrol Rocketport serial card - alternate devices 0 = /dev/cur0 Callout device for ttyR0 1 = /dev/cur1 Callout device for ttyR1 ... 47 block Parallel port ATAPI disk devices 0 = /dev/pf0 First parallel port ATAPI disk 1 = /dev/pf1 Second parallel port ATAPI disk 2 = /dev/pf2 Third parallel port ATAPI disk 3 = /dev/pf3 Fourth parallel port ATAPI disk This driver is intended for floppy disks and similar devices and hence does not support partitioning. 48 char SDL RISCom serial card 0 = /dev/ttyL0 First RISCom port 1 = /dev/ttyL1 Second RISCom port ... 48 block Mylex DAC960 PCI RAID controller; first controller 0 = /dev/rd/c0d0 First disk, whole disk 8 = /dev/rd/c0d1 Second disk, whole disk ... 248 = /dev/rd/c0d31 32nd disk, whole disk For partitions add: 0 = /dev/rd/c?d? Whole disk 1 = /dev/rd/c?d?p1 First partition ... 7 = /dev/rd/c?d?p7 Seventh partition 49 char SDL RISCom serial card - alternate devices 0 = /dev/cul0 Callout device for ttyL0 1 = /dev/cul1 Callout device for ttyL1 ... 49 block Mylex DAC960 PCI RAID controller; second controller 0 = /dev/rd/c1d0 First disk, whole disk 8 = /dev/rd/c1d1 Second disk, whole disk ... 248 = /dev/rd/c1d31 32nd disk, whole disk Partitions are handled as for major 48. 50 char Reserved for GLINT 50 block Mylex DAC960 PCI RAID controller; third controller 0 = /dev/rd/c2d0 First disk, whole disk 8 = /dev/rd/c2d1 Second disk, whole disk ... 248 = /dev/rd/c2d31 32nd disk, whole disk 51 char Baycom radio modem OR Radio Tech BIM-XXX-RS232 radio modem 0 = /dev/bc0 First Baycom radio modem 1 = /dev/bc1 Second Baycom radio modem ... 51 block Mylex DAC960 PCI RAID controller; fourth controller 0 = /dev/rd/c3d0 First disk, whole disk 8 = /dev/rd/c3d1 Second disk, whole disk ... 248 = /dev/rd/c3d31 32nd disk, whole disk Partitions are handled as for major 48. 52 char Spellcaster DataComm/BRI ISDN card 0 = /dev/dcbri0 First DataComm card 1 = /dev/dcbri1 Second DataComm card 2 = /dev/dcbri2 Third DataComm card 3 = /dev/dcbri3 Fourth DataComm card 52 block Mylex DAC960 PCI RAID controller; fifth controller 0 = /dev/rd/c4d0 First disk, whole disk 8 = /dev/rd/c4d1 Second disk, whole disk ... 248 = /dev/rd/c4d31 32nd disk, whole disk Partitions are handled as for major 48. 53 char BDM interface for remote debugging MC683xx microcontrollers 0 = /dev/pd_bdm0 PD BDM interface on lp0 1 = /dev/pd_bdm1 PD BDM interface on lp1 2 = /dev/pd_bdm2 PD BDM interface on lp2 4 = /dev/icd_bdm0 ICD BDM interface on lp0 5 = /dev/icd_bdm1 ICD BDM interface on lp1 6 = /dev/icd_bdm2 ICD BDM interface on lp2 This device is used for the interfacing to the MC683xx microcontrollers via Background Debug Mode by use of a Parallel Port interface. PD is the Motorola Public Domain Interface and ICD is the commercial interface by P&E. 53 block Mylex DAC960 PCI RAID controller; sixth controller 0 = /dev/rd/c5d0 First disk, whole disk 8 = /dev/rd/c5d1 Second disk, whole disk ... 248 = /dev/rd/c5d31 32nd disk, whole disk Partitions are handled as for major 48. 54 char Electrocardiognosis Holter serial card 0 = /dev/holter0 First Holter port 1 = /dev/holter1 Second Holter port 2 = /dev/holter2 Third Holter port A custom serial card used by Electrocardiognosis SRL to transfer data from Holter 24-hour heart monitoring equipment. 54 block Mylex DAC960 PCI RAID controller; seventh controller 0 = /dev/rd/c6d0 First disk, whole disk 8 = /dev/rd/c6d1 Second disk, whole disk ... 248 = /dev/rd/c6d31 32nd disk, whole disk Partitions are handled as for major 48. 55 char DSP56001 digital signal processor 0 = /dev/dsp56k First DSP56001 55 block Mylex DAC960 PCI RAID controller; eighth controller 0 = /dev/rd/c7d0 First disk, whole disk 8 = /dev/rd/c7d1 Second disk, whole disk ... 248 = /dev/rd/c7d31 32nd disk, whole disk Partitions are handled as for major 48. 56 char Apple Desktop Bus 0 = /dev/adb ADB bus control Additional devices will be added to this number, all starting with /dev/adb. 56 block Fifth IDE hard disk/CD-ROM interface 0 = /dev/hdi Master: whole disk (or CD-ROM) 64 = /dev/hdj Slave: whole disk (or CD-ROM) Partitions are handled the same way as for the first interface (see major number 3). 57 char Hayes ESP serial card 0 = /dev/ttyP0 First ESP port 1 = /dev/ttyP1 Second ESP port ... 57 block Sixth IDE hard disk/CD-ROM interface 0 = /dev/hdk Master: whole disk (or CD-ROM) 64 = /dev/hdl Slave: whole disk (or CD-ROM) Partitions are handled the same way as for the first interface (see major number 3). 58 char Hayes ESP serial card - alternate devices 0 = /dev/cup0 Callout device for ttyP0 1 = /dev/cup1 Callout device for ttyP1 ... 58 block Reserved for logical volume manager 59 char sf firewall package 0 = /dev/firewall Communication with sf kernel module 59 block Generic PDA filesystem device 0 = /dev/pda0 First PDA device 1 = /dev/pda1 Second PDA device ... The pda devices are used to mount filesystems on remote pda's (basically slow handheld machines with proprietary OS's and limited memory and storage running small fs translation drivers) through serial / IRDA / parallel links. NAMING CONFLICT -- PROPOSED REVISED NAME /dev/rpda0 etc 60-63 char LOCAL/EXPERIMENTAL USE 60-63 block LOCAL/EXPERIMENTAL USE Allocated for local/experimental use. For devices not assigned official numbers, these ranges should be used in order to avoid conflicting with future assignments. 64 char ENskip kernel encryption package 0 = /dev/enskip Communication with ENskip kernel module 64 block Scramdisk/DriveCrypt encrypted devices 0 = /dev/scramdisk/master Master node for ioctls 1 = /dev/scramdisk/1 First encrypted device 2 = /dev/scramdisk/2 Second encrypted device ... 255 = /dev/scramdisk/255 255th encrypted device The filename of the encrypted container and the passwords are sent via ioctls (using the sdmount tool) to the master node which then activates them via one of the /dev/scramdisk/x nodes for loop mounting (all handled through the sdmount tool). Requested by: andy@scramdisklinux.org 65 char Sundance "plink" Transputer boards (obsolete, unused) 0 = /dev/plink0 First plink device 1 = /dev/plink1 Second plink device 2 = /dev/plink2 Third plink device 3 = /dev/plink3 Fourth plink device 64 = /dev/rplink0 First plink device, raw 65 = /dev/rplink1 Second plink device, raw 66 = /dev/rplink2 Third plink device, raw 67 = /dev/rplink3 Fourth plink device, raw 128 = /dev/plink0d First plink device, debug 129 = /dev/plink1d Second plink device, debug 130 = /dev/plink2d Third plink device, debug 131 = /dev/plink3d Fourth plink device, debug 192 = /dev/rplink0d First plink device, raw, debug 193 = /dev/rplink1d Second plink device, raw, debug 194 = /dev/rplink2d Third plink device, raw, debug 195 = /dev/rplink3d Fourth plink device, raw, debug This is a commercial driver; contact James Howes for information. 65 block SCSI disk devices (16-31) 0 = /dev/sdq 17th SCSI disk whole disk 16 = /dev/sdr 18th SCSI disk whole disk 32 = /dev/sds 19th SCSI disk whole disk ... 240 = /dev/sdaf 32nd SCSI disk whole disk Partitions are handled in the same way as for IDE disks (see major number 3) except that the limit on partitions is 15. 66 char YARC PowerPC PCI coprocessor card 0 = /dev/yppcpci0 First YARC card 1 = /dev/yppcpci1 Second YARC card ... 66 block SCSI disk devices (32-47) 0 = /dev/sdag 33th SCSI disk whole disk 16 = /dev/sdah 34th SCSI disk whole disk 32 = /dev/sdai 35th SCSI disk whole disk ... 240 = /dev/sdav 48nd SCSI disk whole disk Partitions are handled in the same way as for IDE disks (see major number 3) except that the limit on partitions is 15. 67 char Coda network file system 0 = /dev/cfs0 Coda cache manager See http://www.coda.cs.cmu.edu for information about Coda. 67 block SCSI disk devices (48-63) 0 = /dev/sdaw 49th SCSI disk whole disk 16 = /dev/sdax 50th SCSI disk whole disk 32 = /dev/sday 51st SCSI disk whole disk ... 240 = /dev/sdbl 64th SCSI disk whole disk Partitions are handled in the same way as for IDE disks (see major number 3) except that the limit on partitions is 15. 68 char CAPI 2.0 interface 0 = /dev/capi20 Control device 1 = /dev/capi20.00 First CAPI 2.0 application 2 = /dev/capi20.01 Second CAPI 2.0 application ... 20 = /dev/capi20.19 19th CAPI 2.0 application ISDN CAPI 2.0 driver for use with CAPI 2.0 applications; currently supports the AVM B1 card. 68 block SCSI disk devices (64-79) 0 = /dev/sdbm 65th SCSI disk whole disk 16 = /dev/sdbn 66th SCSI disk whole disk 32 = /dev/sdbo 67th SCSI disk whole disk ... 240 = /dev/sdcb 80th SCSI disk whole disk Partitions are handled in the same way as for IDE disks (see major number 3) except that the limit on partitions is 15. 69 char MA16 numeric accelerator card 0 = /dev/ma16 Board memory access 69 block SCSI disk devices (80-95) 0 = /dev/sdcc 81st SCSI disk whole disk 16 = /dev/sdcd 82nd SCSI disk whole disk 32 = /dev/sdce 83th SCSI disk whole disk ... 240 = /dev/sdcr 96th SCSI disk whole disk Partitions are handled in the same way as for IDE disks (see major number 3) except that the limit on partitions is 15. 70 char SpellCaster Protocol Services Interface 0 = /dev/apscfg Configuration interface 1 = /dev/apsauth Authentication interface 2 = /dev/apslog Logging interface 3 = /dev/apsdbg Debugging interface 64 = /dev/apsisdn ISDN command interface 65 = /dev/apsasync Async command interface 128 = /dev/apsmon Monitor interface 70 block SCSI disk devices (96-111) 0 = /dev/sdcs 97th SCSI disk whole disk 16 = /dev/sdct 98th SCSI disk whole disk 32 = /dev/sdcu 99th SCSI disk whole disk ... 240 = /dev/sddh 112nd SCSI disk whole disk Partitions are handled in the same way as for IDE disks (see major number 3) except that the limit on partitions is 15. 71 char Computone IntelliPort II serial card 0 = /dev/ttyF0 IntelliPort II board 0, port 0 1 = /dev/ttyF1 IntelliPort II board 0, port 1 ... 63 = /dev/ttyF63 IntelliPort II board 0, port 63 64 = /dev/ttyF64 IntelliPort II board 1, port 0 65 = /dev/ttyF65 IntelliPort II board 1, port 1 ... 127 = /dev/ttyF127 IntelliPort II board 1, port 63 128 = /dev/ttyF128 IntelliPort II board 2, port 0 129 = /dev/ttyF129 IntelliPort II board 2, port 1 ... 191 = /dev/ttyF191 IntelliPort II board 2, port 63 192 = /dev/ttyF192 IntelliPort II board 3, port 0 193 = /dev/ttyF193 IntelliPort II board 3, port 1 ... 255 = /dev/ttyF255 IntelliPort II board 3, port 63 71 block SCSI disk devices (112-127) 0 = /dev/sddi 113th SCSI disk whole disk 16 = /dev/sddj 114th SCSI disk whole disk 32 = /dev/sddk 115th SCSI disk whole disk ... 240 = /dev/sddx 128th SCSI disk whole disk Partitions are handled in the same way as for IDE disks (see major number 3) except that the limit on partitions is 15. 72 char Computone IntelliPort II serial card - alternate devices 0 = /dev/cuf0 Callout device for ttyF0 1 = /dev/cuf1 Callout device for ttyF1 ... 63 = /dev/cuf63 Callout device for ttyF63 64 = /dev/cuf64 Callout device for ttyF64 65 = /dev/cuf65 Callout device for ttyF65 ... 127 = /dev/cuf127 Callout device for ttyF127 128 = /dev/cuf128 Callout device for ttyF128 129 = /dev/cuf129 Callout device for ttyF129 ... 191 = /dev/cuf191 Callout device for ttyF191 192 = /dev/cuf192 Callout device for ttyF192 193 = /dev/cuf193 Callout device for ttyF193 ... 255 = /dev/cuf255 Callout device for ttyF255 72 block Compaq Intelligent Drive Array, first controller 0 = /dev/ida/c0d0 First logical drive whole disk 16 = /dev/ida/c0d1 Second logical drive whole disk ... 240 = /dev/ida/c0d15 16th logical drive whole disk Partitions are handled the same way as for Mylex DAC960 (see major number 48) except that the limit on partitions is 15. 73 char Computone IntelliPort II serial card - control devices 0 = /dev/ip2ipl0 Loadware device for board 0 1 = /dev/ip2stat0 Status device for board 0 4 = /dev/ip2ipl1 Loadware device for board 1 5 = /dev/ip2stat1 Status device for board 1 8 = /dev/ip2ipl2 Loadware device for board 2 9 = /dev/ip2stat2 Status device for board 2 12 = /dev/ip2ipl3 Loadware device for board 3 13 = /dev/ip2stat3 Status device for board 3 73 block Compaq Intelligent Drive Array, second controller 0 = /dev/ida/c1d0 First logical drive whole disk 16 = /dev/ida/c1d1 Second logical drive whole disk ... 240 = /dev/ida/c1d15 16th logical drive whole disk Partitions are handled the same way as for Mylex DAC960 (see major number 48) except that the limit on partitions is 15. 74 char SCI bridge 0 = /dev/SCI/0 SCI device 0 1 = /dev/SCI/1 SCI device 1 ... Currently for Dolphin Interconnect Solutions' PCI-SCI bridge. 74 block Compaq Intelligent Drive Array, third controller 0 = /dev/ida/c2d0 First logical drive whole disk 16 = /dev/ida/c2d1 Second logical drive whole disk ... 240 = /dev/ida/c2d15 16th logical drive whole disk Partitions are handled the same way as for Mylex DAC960 (see major number 48) except that the limit on partitions is 15. 75 char Specialix IO8+ serial card 0 = /dev/ttyW0 First IO8+ port, first card 1 = /dev/ttyW1 Second IO8+ port, first card ... 8 = /dev/ttyW8 First IO8+ port, second card ... 75 block Compaq Intelligent Drive Array, fourth controller 0 = /dev/ida/c3d0 First logical drive whole disk 16 = /dev/ida/c3d1 Second logical drive whole disk ... 240 = /dev/ida/c3d15 16th logical drive whole disk Partitions are handled the same way as for Mylex DAC960 (see major number 48) except that the limit on partitions is 15. 76 char Specialix IO8+ serial card - alternate devices 0 = /dev/cuw0 Callout device for ttyW0 1 = /dev/cuw1 Callout device for ttyW1 ... 8 = /dev/cuw8 Callout device for ttyW8 ... 76 block Compaq Intelligent Drive Array, fifth controller 0 = /dev/ida/c4d0 First logical drive whole disk 16 = /dev/ida/c4d1 Second logical drive whole disk ... 240 = /dev/ida/c4d15 16th logical drive whole disk Partitions are handled the same way as for Mylex DAC960 (see major number 48) except that the limit on partitions is 15. 77 char ComScire Quantum Noise Generator 0 = /dev/qng ComScire Quantum Noise Generator 77 block Compaq Intelligent Drive Array, sixth controller 0 = /dev/ida/c5d0 First logical drive whole disk 16 = /dev/ida/c5d1 Second logical drive whole disk ... 240 = /dev/ida/c5d15 16th logical drive whole disk Partitions are handled the same way as for Mylex DAC960 (see major number 48) except that the limit on partitions is 15. 78 char PAM Software's multimodem boards 0 = /dev/ttyM0 First PAM modem 1 = /dev/ttyM1 Second PAM modem ... 78 block Compaq Intelligent Drive Array, seventh controller 0 = /dev/ida/c6d0 First logical drive whole disk 16 = /dev/ida/c6d1 Second logical drive whole disk ... 240 = /dev/ida/c6d15 16th logical drive whole disk Partitions are handled the same way as for Mylex DAC960 (see major number 48) except that the limit on partitions is 15. 79 char PAM Software's multimodem boards - alternate devices 0 = /dev/cum0 Callout device for ttyM0 1 = /dev/cum1 Callout device for ttyM1 ... 79 block Compaq Intelligent Drive Array, eighth controller 0 = /dev/ida/c7d0 First logical drive whole disk 16 = /dev/ida/c7d1 Second logical drive whole disk ... 240 = /dev/ida/c715 16th logical drive whole disk Partitions are handled the same way as for Mylex DAC960 (see major number 48) except that the limit on partitions is 15. 80 char Photometrics AT200 CCD camera 0 = /dev/at200 Photometrics AT200 CCD camera 80 block I2O hard disk 0 = /dev/i2o/hda First I2O hard disk, whole disk 16 = /dev/i2o/hdb Second I2O hard disk, whole disk ... 240 = /dev/i2o/hdp 16th I2O hard disk, whole disk Partitions are handled in the same way as for IDE disks (see major number 3) except that the limit on partitions is 15. 81 char video4linux 0 = /dev/video0 Video capture/overlay device ... 63 = /dev/video63 Video capture/overlay device 64 = /dev/radio0 Radio device ... 127 = /dev/radio63 Radio device 224 = /dev/vbi0 Vertical blank interrupt ... 255 = /dev/vbi31 Vertical blank interrupt 81 block I2O hard disk 0 = /dev/i2o/hdq 17th I2O hard disk, whole disk 16 = /dev/i2o/hdr 18th I2O hard disk, whole disk ... 240 = /dev/i2o/hdaf 32nd I2O hard disk, whole disk Partitions are handled in the same way as for IDE disks (see major number 3) except that the limit on partitions is 15. 82 char WiNRADiO communications receiver card 0 = /dev/winradio0 First WiNRADiO card 1 = /dev/winradio1 Second WiNRADiO card ... The driver and documentation may be obtained from http://www.winradio.com/ 82 block I2O hard disk 0 = /dev/i2o/hdag 33rd I2O hard disk, whole disk 16 = /dev/i2o/hdah 34th I2O hard disk, whole disk ... 240 = /dev/i2o/hdav 48th I2O hard disk, whole disk Partitions are handled in the same way as for IDE disks (see major number 3) except that the limit on partitions is 15. 83 char Matrox mga_vid video driver 0 = /dev/mga_vid0 1st video card 1 = /dev/mga_vid1 2nd video card 2 = /dev/mga_vid2 3rd video card ... 15 = /dev/mga_vid15 16th video card 83 block I2O hard disk 0 = /dev/i2o/hdaw 49th I2O hard disk, whole disk 16 = /dev/i2o/hdax 50th I2O hard disk, whole disk ... 240 = /dev/i2o/hdbl 64th I2O hard disk, whole disk Partitions are handled in the same way as for IDE disks (see major number 3) except that the limit on partitions is 15. 84 char Ikon 1011[57] Versatec Greensheet Interface 0 = /dev/ihcp0 First Greensheet port 1 = /dev/ihcp1 Second Greensheet port 84 block I2O hard disk 0 = /dev/i2o/hdbm 65th I2O hard disk, whole disk 16 = /dev/i2o/hdbn 66th I2O hard disk, whole disk ... 240 = /dev/i2o/hdcb 80th I2O hard disk, whole disk Partitions are handled in the same way as for IDE disks (see major number 3) except that the limit on partitions is 15. 85 char Linux/SGI shared memory input queue 0 = /dev/shmiq Master shared input queue 1 = /dev/qcntl0 First device pushed 2 = /dev/qcntl1 Second device pushed ... 85 block I2O hard disk 0 = /dev/i2o/hdcc 81st I2O hard disk, whole disk 16 = /dev/i2o/hdcd 82nd I2O hard disk, whole disk ... 240 = /dev/i2o/hdcr 96th I2O hard disk, whole disk Partitions are handled in the same way as for IDE disks (see major number 3) except that the limit on partitions is 15. 86 char SCSI media changer 0 = /dev/sch0 First SCSI media changer 1 = /dev/sch1 Second SCSI media changer ... 86 block I2O hard disk 0 = /dev/i2o/hdcs 97th I2O hard disk, whole disk 16 = /dev/i2o/hdct 98th I2O hard disk, whole disk ... 240 = /dev/i2o/hddh 112th I2O hard disk, whole disk Partitions are handled in the same way as for IDE disks (see major number 3) except that the limit on partitions is 15. 87 char Sony Control-A1 stereo control bus 0 = /dev/controla0 First device on chain 1 = /dev/controla1 Second device on chain ... 87 block I2O hard disk 0 = /dev/i2o/hddi 113rd I2O hard disk, whole disk 16 = /dev/i2o/hddj 114th I2O hard disk, whole disk ... 240 = /dev/i2o/hddx 128th I2O hard disk, whole disk Partitions are handled in the same way as for IDE disks (see major number 3) except that the limit on partitions is 15. 88 char COMX synchronous serial card 0 = /dev/comx0 COMX channel 0 1 = /dev/comx1 COMX channel 1 ... 88 block Seventh IDE hard disk/CD-ROM interface 0 = /dev/hdm Master: whole disk (or CD-ROM) 64 = /dev/hdn Slave: whole disk (or CD-ROM) Partitions are handled the same way as for the first interface (see major number 3). 89 char I2C bus interface 0 = /dev/i2c-0 First I2C adapter 1 = /dev/i2c-1 Second I2C adapter ... 89 block Eighth IDE hard disk/CD-ROM interface 0 = /dev/hdo Master: whole disk (or CD-ROM) 64 = /dev/hdp Slave: whole disk (or CD-ROM) Partitions are handled the same way as for the first interface (see major number 3). 90 char Memory Technology Device (RAM, ROM, Flash) 0 = /dev/mtd0 First MTD (rw) 1 = /dev/mtdr0 First MTD (ro) ... 30 = /dev/mtd15 16th MTD (rw) 31 = /dev/mtdr15 16th MTD (ro) 90 block Ninth IDE hard disk/CD-ROM interface 0 = /dev/hdq Master: whole disk (or CD-ROM) 64 = /dev/hdr Slave: whole disk (or CD-ROM) Partitions are handled the same way as for the first interface (see major number 3). 91 char CAN-Bus devices 0 = /dev/can0 First CAN-Bus controller 1 = /dev/can1 Second CAN-Bus controller ... 91 block Tenth IDE hard disk/CD-ROM interface 0 = /dev/hds Master: whole disk (or CD-ROM) 64 = /dev/hdt Slave: whole disk (or CD-ROM) Partitions are handled the same way as for the first interface (see major number 3). 92 char Reserved for ith Kommunikationstechnik MIC ISDN card 92 block PPDD encrypted disk driver 0 = /dev/ppdd0 First encrypted disk 1 = /dev/ppdd1 Second encrypted disk ... Partitions are handled in the same way as for IDE disks (see major number 3) except that the limit on partitions is 15. 93 char 93 block NAND Flash Translation Layer filesystem 0 = /dev/nftla First NFTL layer 16 = /dev/nftlb Second NFTL layer ... 240 = /dev/nftlp 16th NTFL layer 94 char 94 block IBM S/390 DASD block storage 0 = /dev/dasda First DASD device, major 1 = /dev/dasda1 First DASD device, block 1 2 = /dev/dasda2 First DASD device, block 2 3 = /dev/dasda3 First DASD device, block 3 4 = /dev/dasdb Second DASD device, major 5 = /dev/dasdb1 Second DASD device, block 1 6 = /dev/dasdb2 Second DASD device, block 2 7 = /dev/dasdb3 Second DASD device, block 3 ... 95 char IP filter 0 = /dev/ipl Filter control device/log file 1 = /dev/ipnat NAT control device/log file 2 = /dev/ipstate State information log file 3 = /dev/ipauth Authentication control device/log file ... 96 char Parallel port ATAPI tape devices 0 = /dev/pt0 First parallel port ATAPI tape 1 = /dev/pt1 Second parallel port ATAPI tape ... 128 = /dev/npt0 First p.p. ATAPI tape, no rewind 129 = /dev/npt1 Second p.p. ATAPI tape, no rewind ... 96 block Inverse NAND Flash Translation Layer 0 = /dev/inftla First INFTL layer 16 = /dev/inftlb Second INFTL layer ... 240 = /dev/inftlp 16th INTFL layer 97 char Parallel port generic ATAPI interface 0 = /dev/pg0 First parallel port ATAPI device 1 = /dev/pg1 Second parallel port ATAPI device 2 = /dev/pg2 Third parallel port ATAPI device 3 = /dev/pg3 Fourth parallel port ATAPI device These devices support the same API as the generic SCSI devices. 98 char Control and Measurement Device (comedi) 0 = /dev/comedi0 First comedi device 1 = /dev/comedi1 Second comedi device ... See http://stm.lbl.gov/comedi. 98 block User-mode virtual block device 0 = /dev/ubda First user-mode block device 16 = /dev/udbb Second user-mode block device ... Partitions are handled in the same way as for IDE disks (see major number 3) except that the limit on partitions is 15. This device is used by the user-mode virtual kernel port. 99 char Raw parallel ports 0 = /dev/parport0 First parallel port 1 = /dev/parport1 Second parallel port ... 99 block JavaStation flash disk 0 = /dev/jsfd JavaStation flash disk 100 char Telephony for Linux 0 = /dev/phone0 First telephony device 1 = /dev/phone1 Second telephony device ... 101 char Motorola DSP 56xxx board 0 = /dev/mdspstat Status information 1 = /dev/mdsp1 First DSP board I/O controls ... 16 = /dev/mdsp16 16th DSP board I/O controls 101 block AMI HyperDisk RAID controller 0 = /dev/amiraid/ar0 First array whole disk 16 = /dev/amiraid/ar1 Second array whole disk ... 240 = /dev/amiraid/ar15 16th array whole disk For each device, partitions are added as: 0 = /dev/amiraid/ar? Whole disk 1 = /dev/amiraid/ar?p1 First partition 2 = /dev/amiraid/ar?p2 Second partition ... 15 = /dev/amiraid/ar?p15 15th partition 102 char 102 block Compressed block device 0 = /dev/cbd/a First compressed block device, whole device 16 = /dev/cbd/b Second compressed block device, whole device ... 240 = /dev/cbd/p 16th compressed block device, whole device Partitions are handled in the same way as for IDE disks (see major number 3) except that the limit on partitions is 15. 103 char Arla network file system 0 = /dev/nnpfs0 First NNPFS device 1 = /dev/nnpfs1 Second NNPFS device Arla is a free clone of the Andrew File System, AFS. The NNPFS device gives user mode filesystem implementations a kernel presence for caching and easy mounting. For more information about the project, write to or see http://www.stacken.kth.se/project/arla/ 103 block Audit device 0 = /dev/audit Audit device 104 char Flash BIOS support 104 block Compaq Next Generation Drive Array, first controller 0 = /dev/cciss/c0d0 First logical drive, whole disk 16 = /dev/cciss/c0d1 Second logical drive, whole disk ... 240 = /dev/cciss/c0d15 16th logical drive, whole disk Partitions are handled the same way as for Mylex DAC960 (see major number 48) except that the limit on partitions is 15. 105 char Comtrol VS-1000 serial controller 0 = /dev/ttyV0 First VS-1000 port 1 = /dev/ttyV1 Second VS-1000 port ... 105 block Compaq Next Generation Drive Array, second controller 0 = /dev/cciss/c1d0 First logical drive, whole disk 16 = /dev/cciss/c1d1 Second logical drive, whole disk ... 240 = /dev/cciss/c1d15 16th logical drive, whole disk Partitions are handled the same way as for Mylex DAC960 (see major number 48) except that the limit on partitions is 15. 106 char Comtrol VS-1000 serial controller - alternate devices 0 = /dev/cuv0 First VS-1000 port 1 = /dev/cuv1 Second VS-1000 port ... 106 block Compaq Next Generation Drive Array, third controller 0 = /dev/cciss/c2d0 First logical drive, whole disk 16 = /dev/cciss/c2d1 Second logical drive, whole disk ... 240 = /dev/cciss/c2d15 16th logical drive, whole disk Partitions are handled the same way as for Mylex DAC960 (see major number 48) except that the limit on partitions is 15. 107 char 3Dfx Voodoo Graphics device 0 = /dev/3dfx Primary 3Dfx graphics device 107 block Compaq Next Generation Drive Array, fourth controller 0 = /dev/cciss/c3d0 First logical drive, whole disk 16 = /dev/cciss/c3d1 Second logical drive, whole disk ... 240 = /dev/cciss/c3d15 16th logical drive, whole disk Partitions are handled the same way as for Mylex DAC960 (see major number 48) except that the limit on partitions is 15. 108 char Device independent PPP interface 0 = /dev/ppp Device independent PPP interface 108 block Compaq Next Generation Drive Array, fifth controller 0 = /dev/cciss/c4d0 First logical drive, whole disk 16 = /dev/cciss/c4d1 Second logical drive, whole disk ... 240 = /dev/cciss/c4d15 16th logical drive, whole disk Partitions are handled the same way as for Mylex DAC960 (see major number 48) except that the limit on partitions is 15. 109 char Reserved for logical volume manager 109 block Compaq Next Generation Drive Array, sixth controller 0 = /dev/cciss/c5d0 First logical drive, whole disk 16 = /dev/cciss/c5d1 Second logical drive, whole disk ... 240 = /dev/cciss/c5d15 16th logical drive, whole disk Partitions are handled the same way as for Mylex DAC960 (see major number 48) except that the limit on partitions is 15. 110 char miroMEDIA Surround board 0 = /dev/srnd0 First miroMEDIA Surround board 1 = /dev/srnd1 Second miroMEDIA Surround board ... 110 block Compaq Next Generation Drive Array, seventh controller 0 = /dev/cciss/c6d0 First logical drive, whole disk 16 = /dev/cciss/c6d1 Second logical drive, whole disk ... 240 = /dev/cciss/c6d15 16th logical drive, whole disk Partitions are handled the same way as for Mylex DAC960 (see major number 48) except that the limit on partitions is 15. 111 char 111 block Compaq Next Generation Drive Array, eighth controller 0 = /dev/cciss/c7d0 First logical drive, whole disk 16 = /dev/cciss/c7d1 Second logical drive, whole disk ... 240 = /dev/cciss/c7d15 16th logical drive, whole disk Partitions are handled the same way as for Mylex DAC960 (see major number 48) except that the limit on partitions is 15. 112 char ISI serial card 0 = /dev/ttyM0 First ISI port 1 = /dev/ttyM1 Second ISI port ... There is currently a device-naming conflict between these and PAM multimodems (major 78). 112 block IBM iSeries virtual disk 0 = /dev/iseries/vda First virtual disk, whole disk 8 = /dev/iseries/vdb Second virtual disk, whole disk ... 200 = /dev/iseries/vdz 26th virtual disk, whole disk 208 = /dev/iseries/vdaa 27th virtual disk, whole disk ... 248 = /dev/iseries/vdaf 32nd virtual disk, whole disk Partitions are handled in the same way as for IDE disks (see major number 3) except that the limit on partitions is 7. 113 char ISI serial card - alternate devices 0 = /dev/cum0 Callout device for ttyM0 1 = /dev/cum1 Callout device for ttyM1 ... 113 block IBM iSeries virtual CD-ROM 0 = /dev/iseries/vcda First virtual CD-ROM 1 = /dev/iseries/vcdb Second virtual CD-ROM ... 114 char Picture Elements ISE board 0 = /dev/ise0 First ISE board 1 = /dev/ise1 Second ISE board ... 128 = /dev/isex0 Control node for first ISE board 129 = /dev/isex1 Control node for second ISE board ... The ISE board is an embedded computer, optimized for image processing. The /dev/iseN nodes are the general I/O access to the board, the /dev/isex0 nodes command nodes used to control the board. 114 block IDE BIOS powered software RAID interfaces such as the Promise Fastrak 0 = /dev/ataraid/d0 1 = /dev/ataraid/d0p1 2 = /dev/ataraid/d0p2 ... 16 = /dev/ataraid/d1 17 = /dev/ataraid/d1p1 18 = /dev/ataraid/d1p2 ... 255 = /dev/ataraid/d15p15 Partitions are handled in the same way as for IDE disks (see major number 3) except that the limit on partitions is 15. 115 char TI link cable devices (115 was formerly the console driver speaker) 0 = /dev/tipar0 Parallel cable on first parallel port ... 7 = /dev/tipar7 Parallel cable on seventh parallel port 8 = /dev/tiser0 Serial cable on first serial port ... 15 = /dev/tiser7 Serial cable on seventh serial port 16 = /dev/tiusb0 First USB cable ... 47 = /dev/tiusb31 32nd USB cable 115 block NetWare (NWFS) Devices (0-255) The NWFS (NetWare) devices are used to present a collection of NetWare Mirror Groups or NetWare Partitions as a logical storage segment for use in mounting NetWare volumes. A maximum of 256 NetWare volumes can be supported in a single machine. http://cgfa.telepac.pt/ftp2/kernel.org/linux/kernel/people/jmerkey/nwfs/ 0 = /dev/nwfs/v0 First NetWare (NWFS) Logical Volume 1 = /dev/nwfs/v1 Second NetWare (NWFS) Logical Volume 2 = /dev/nwfs/v2 Third NetWare (NWFS) Logical Volume ... 255 = /dev/nwfs/v255 Last NetWare (NWFS) Logical Volume 116 char Advanced Linux Sound Driver (ALSA) 116 block MicroMemory battery backed RAM adapter (NVRAM) Supports 16 boards, 15 partitions each. Requested by neilb at cse.unsw.edu.au. 0 = /dev/umem/d0 Whole of first board 1 = /dev/umem/d0p1 First partition of first board 2 = /dev/umem/d0p2 Second partition of first board 15 = /dev/umem/d0p15 15th partition of first board 16 = /dev/umem/d1 Whole of second board 17 = /dev/umem/d1p1 First partition of second board ... 255= /dev/umem/d15p15 15th partition of 16th board. 117 char COSA/SRP synchronous serial card 0 = /dev/cosa0c0 1st board, 1st channel 1 = /dev/cosa0c1 1st board, 2nd channel ... 16 = /dev/cosa1c0 2nd board, 1st channel 17 = /dev/cosa1c1 2nd board, 2nd channel ... 117 block Enterprise Volume Management System (EVMS) The EVMS driver uses a layered, plug-in model to provide unparalleled flexibility and extensibility in managing storage. This allows for easy expansion or customization of various levels of volume management. Requested by Mark Peloquin (peloquin at us.ibm.com). Note: EVMS populates and manages all the devnodes in /dev/evms. http://sf.net/projects/evms 0 = /dev/evms/block_device EVMS block device 1 = /dev/evms/legacyname1 First EVMS legacy device 2 = /dev/evms/legacyname2 Second EVMS legacy device ... Both ranges can grow (down or up) until they meet. ... 254 = /dev/evms/EVMSname2 Second EVMS native device 255 = /dev/evms/EVMSname1 First EVMS native device Note: legacyname(s) are derived from the normal legacy device names. For example, /dev/hda5 would become /dev/evms/hda5. 118 char IBM Cryptographic Accelerator 0 = /dev/ica Virtual interface to all IBM Crypto Accelerators 1 = /dev/ica0 IBMCA Device 0 2 = /dev/ica1 IBMCA Device 1 ... 119 char VMware virtual network control 0 = /dev/vnet0 1st virtual network 1 = /dev/vnet1 2nd virtual network ... 120-127 char LOCAL/EXPERIMENTAL USE 120-127 block LOCAL/EXPERIMENTAL USE Allocated for local/experimental use. For devices not assigned official numbers, these ranges should be used in order to avoid conflicting with future assignments. 128-135 char Unix98 PTY masters These devices should not have corresponding device nodes; instead they should be accessed through the /dev/ptmx cloning interface. 128 block SCSI disk devices (128-143) 0 = /dev/sddy 129th SCSI disk whole disk 16 = /dev/sddz 130th SCSI disk whole disk 32 = /dev/sdea 131th SCSI disk whole disk ... 240 = /dev/sden 144th SCSI disk whole disk Partitions are handled in the same way as for IDE disks (see major number 3) except that the limit on partitions is 15. 129 block SCSI disk devices (144-159) 0 = /dev/sdeo 145th SCSI disk whole disk 16 = /dev/sdep 146th SCSI disk whole disk 32 = /dev/sdeq 147th SCSI disk whole disk ... 240 = /dev/sdfd 160th SCSI disk whole disk Partitions are handled in the same way as for IDE disks (see major number 3) except that the limit on partitions is 15. 130 char (Misc devices) 130 block SCSI disk devices (160-175) 0 = /dev/sdfe 161st SCSI disk whole disk 16 = /dev/sdff 162nd SCSI disk whole disk 32 = /dev/sdfg 163rd SCSI disk whole disk ... 240 = /dev/sdft 176th SCSI disk whole disk Partitions are handled in the same way as for IDE disks (see major number 3) except that the limit on partitions is 15. 131 block SCSI disk devices (176-191) 0 = /dev/sdfu 177th SCSI disk whole disk 16 = /dev/sdfv 178th SCSI disk whole disk 32 = /dev/sdfw 179th SCSI disk whole disk ... 240 = /dev/sdgj 192nd SCSI disk whole disk Partitions are handled in the same way as for IDE disks (see major number 3) except that the limit on partitions is 15. 132 block SCSI disk devices (192-207) 0 = /dev/sdgk 193rd SCSI disk whole disk 16 = /dev/sdgl 194th SCSI disk whole disk 32 = /dev/sdgm 195th SCSI disk whole disk ... 240 = /dev/sdgz 208th SCSI disk whole disk Partitions are handled in the same way as for IDE disks (see major number 3) except that the limit on partitions is 15. 133 block SCSI disk devices (208-223) 0 = /dev/sdha 209th SCSI disk whole disk 16 = /dev/sdhb 210th SCSI disk whole disk 32 = /dev/sdhc 211th SCSI disk whole disk ... 240 = /dev/sdhp 224th SCSI disk whole disk Partitions are handled in the same way as for IDE disks (see major number 3) except that the limit on partitions is 15. 134 block SCSI disk devices (224-239) 0 = /dev/sdhq 225th SCSI disk whole disk 16 = /dev/sdhr 226th SCSI disk whole disk 32 = /dev/sdhs 227th SCSI disk whole disk ... 240 = /dev/sdif 240th SCSI disk whole disk Partitions are handled in the same way as for IDE disks (see major number 3) except that the limit on partitions is 15. 135 block SCSI disk devices (240-255) 0 = /dev/sdig 241st SCSI disk whole disk 16 = /dev/sdih 242nd SCSI disk whole disk 32 = /dev/sdih 243rd SCSI disk whole disk ... 240 = /dev/sdiv 256th SCSI disk whole disk Partitions are handled in the same way as for IDE disks (see major number 3) except that the limit on partitions is 15. 136-143 char Unix98 PTY slaves 0 = /dev/pts/0 First Unix98 pseudo-TTY 1 = /dev/pts/1 Second Unix98 pseudo-TTY ... These device nodes are automatically generated with the proper permissions and modes by mounting the devpts filesystem onto /dev/pts with the appropriate mount options (distribution dependent, however, on *most* distributions the appropriate options are "mode=0620,gid=".) 136 block Mylex DAC960 PCI RAID controller; ninth controller 0 = /dev/rd/c8d0 First disk, whole disk 8 = /dev/rd/c8d1 Second disk, whole disk ... 248 = /dev/rd/c8d31 32nd disk, whole disk Partitions are handled as for major 48. 137 block Mylex DAC960 PCI RAID controller; tenth controller 0 = /dev/rd/c9d0 First disk, whole disk 8 = /dev/rd/c9d1 Second disk, whole disk ... 248 = /dev/rd/c9d31 32nd disk, whole disk Partitions are handled as for major 48. 138 block Mylex DAC960 PCI RAID controller; eleventh controller 0 = /dev/rd/c10d0 First disk, whole disk 8 = /dev/rd/c10d1 Second disk, whole disk ... 248 = /dev/rd/c10d31 32nd disk, whole disk Partitions are handled as for major 48. 139 block Mylex DAC960 PCI RAID controller; twelfth controller 0 = /dev/rd/c11d0 First disk, whole disk 8 = /dev/rd/c11d1 Second disk, whole disk ... 248 = /dev/rd/c11d31 32nd disk, whole disk Partitions are handled as for major 48. 140 block Mylex DAC960 PCI RAID controller; thirteenth controller 0 = /dev/rd/c12d0 First disk, whole disk 8 = /dev/rd/c12d1 Second disk, whole disk ... 248 = /dev/rd/c12d31 32nd disk, whole disk Partitions are handled as for major 48. 141 block Mylex DAC960 PCI RAID controller; fourteenth controller 0 = /dev/rd/c13d0 First disk, whole disk 8 = /dev/rd/c13d1 Second disk, whole disk ... 248 = /dev/rd/c13d31 32nd disk, whole disk Partitions are handled as for major 48. 142 block Mylex DAC960 PCI RAID controller; fifteenth controller 0 = /dev/rd/c14d0 First disk, whole disk 8 = /dev/rd/c14d1 Second disk, whole disk ... 248 = /dev/rd/c14d31 32nd disk, whole disk Partitions are handled as for major 48. 143 block Mylex DAC960 PCI RAID controller; sixteenth controller 0 = /dev/rd/c15d0 First disk, whole disk 8 = /dev/rd/c15d1 Second disk, whole disk ... 248 = /dev/rd/c15d31 32nd disk, whole disk Partitions are handled as for major 48. 144 char Encapsulated PPP 0 = /dev/pppox0 First PPP over Ethernet ... 63 = /dev/pppox63 64th PPP over Ethernet This is primarily used for ADSL. The SST 5136-DN DeviceNet interface driver has been relocated to major 183 due to an unfortunate conflict. 144 block Expansion Area #1 for more non-device (e.g. NFS) mounts 0 = mounted device 256 255 = mounted device 511 145 char SAM9407-based soundcard 0 = /dev/sam0_mixer 1 = /dev/sam0_sequencer 2 = /dev/sam0_midi00 3 = /dev/sam0_dsp 4 = /dev/sam0_audio 6 = /dev/sam0_sndstat 18 = /dev/sam0_midi01 34 = /dev/sam0_midi02 50 = /dev/sam0_midi03 64 = /dev/sam1_mixer ... 128 = /dev/sam2_mixer ... 192 = /dev/sam3_mixer ... Device functions match OSS, but offer a number of addons, which are sam9407 specific. OSS can be operated simultaneously, taking care of the codec. 145 block Expansion Area #2 for more non-device (e.g. NFS) mounts 0 = mounted device 512 255 = mounted device 767 146 char SYSTRAM SCRAMNet mirrored-memory network 0 = /dev/scramnet0 First SCRAMNet device 1 = /dev/scramnet1 Second SCRAMNet device ... 146 block Expansion Area #3 for more non-device (e.g. NFS) mounts 0 = mounted device 768 255 = mounted device 1023 147 char Aureal Semiconductor Vortex Audio device 0 = /dev/aureal0 First Aureal Vortex 1 = /dev/aureal1 Second Aureal Vortex ... 147 block Distributed Replicated Block Device (DRBD) 0 = /dev/drbd0 First DRBD device 1 = /dev/drbd1 Second DRBD device ... 148 char Technology Concepts serial card 0 = /dev/ttyT0 First TCL port 1 = /dev/ttyT1 Second TCL port ... 149 char Technology Concepts serial card - alternate devices 0 = /dev/cut0 Callout device for ttyT0 1 = /dev/cut0 Callout device for ttyT1 ... 150 char Real-Time Linux FIFOs 0 = /dev/rtf0 First RTLinux FIFO 1 = /dev/rtf1 Second RTLinux FIFO ... 151 char DPT I2O SmartRaid V controller 0 = /dev/dpti0 First DPT I2O adapter 1 = /dev/dpti1 Second DPT I2O adapter ... 152 char EtherDrive Control Device 0 = /dev/etherd/ctl Connect/Disconnect an EtherDrive 1 = /dev/etherd/err Monitor errors 2 = /dev/etherd/raw Raw AoE packet monitor 152 block EtherDrive Block Devices 0 = /dev/etherd/0 EtherDrive 0 ... 255 = /dev/etherd/255 EtherDrive 255 153 char SPI Bus Interface (sometimes referred to as MicroWire) 0 = /dev/spi0 First SPI device on the bus 1 = /dev/spi1 Second SPI device on the bus ... 15 = /dev/spi15 Sixteenth SPI device on the bus 153 block Enhanced Metadisk RAID (EMD) storage units 0 = /dev/emd/0 First unit 1 = /dev/emd/0p1 Partition 1 on First unit 2 = /dev/emd/0p2 Partition 2 on First unit ... 15 = /dev/emd/0p15 Partition 15 on First unit 16 = /dev/emd/1 Second unit 32 = /dev/emd/2 Third unit ... 240 = /dev/emd/15 Sixteenth unit Partitions are handled in the same way as for IDE disks (see major number 3) except that the limit on partitions is 15. 154 char Specialix RIO serial card 0 = /dev/ttySR0 First RIO port ... 255 = /dev/ttySR255 256th RIO port 155 char Specialix RIO serial card - alternate devices 0 = /dev/cusr0 Callout device for ttySR0 ... 255 = /dev/cusr255 Callout device for ttySR255 156 char Specialix RIO serial card 0 = /dev/ttySR256 257th RIO port ... 255 = /dev/ttySR511 512th RIO port 157 char Specialix RIO serial card - alternate devices 0 = /dev/cusr256 Callout device for ttySR256 ... 255 = /dev/cusr511 Callout device for ttySR511 158 char Dialogic GammaLink fax driver 0 = /dev/gfax0 GammaLink channel 0 1 = /dev/gfax1 GammaLink channel 1 ... 159 char RESERVED 159 block RESERVED 160 char General Purpose Instrument Bus (GPIB) 0 = /dev/gpib0 First GPIB bus 1 = /dev/gpib1 Second GPIB bus ... 160 block Carmel 8-port SATA Disks on First Controller 0 = /dev/carmel/0 SATA disk 0 whole disk 1 = /dev/carmel/0p1 SATA disk 0 partition 1 ... 31 = /dev/carmel/0p31 SATA disk 0 partition 31 32 = /dev/carmel/1 SATA disk 1 whole disk 64 = /dev/carmel/2 SATA disk 2 whole disk ... 224 = /dev/carmel/7 SATA disk 7 whole disk Partitions are handled in the same way as for IDE disks (see major number 3) except that the limit on partitions is 31. 161 char IrCOMM devices (IrDA serial/parallel emulation) 0 = /dev/ircomm0 First IrCOMM device 1 = /dev/ircomm1 Second IrCOMM device ... 16 = /dev/irlpt0 First IrLPT device 17 = /dev/irlpt1 Second IrLPT device ... 161 block Carmel 8-port SATA Disks on Second Controller 0 = /dev/carmel/8 SATA disk 8 whole disk 1 = /dev/carmel/8p1 SATA disk 8 partition 1 ... 31 = /dev/carmel/8p31 SATA disk 8 partition 31 32 = /dev/carmel/9 SATA disk 9 whole disk 64 = /dev/carmel/10 SATA disk 10 whole disk ... 224 = /dev/carmel/15 SATA disk 15 whole disk Partitions are handled in the same way as for IDE disks (see major number 3) except that the limit on partitions is 31. 162 char Raw block device interface 0 = /dev/rawctl Raw I/O control device 1 = /dev/raw/raw1 First raw I/O device 2 = /dev/raw/raw2 Second raw I/O device ... 163 char 164 char Chase Research AT/PCI-Fast serial card 0 = /dev/ttyCH0 AT/PCI-Fast board 0, port 0 ... 15 = /dev/ttyCH15 AT/PCI-Fast board 0, port 15 16 = /dev/ttyCH16 AT/PCI-Fast board 1, port 0 ... 31 = /dev/ttyCH31 AT/PCI-Fast board 1, port 15 32 = /dev/ttyCH32 AT/PCI-Fast board 2, port 0 ... 47 = /dev/ttyCH47 AT/PCI-Fast board 2, port 15 48 = /dev/ttyCH48 AT/PCI-Fast board 3, port 0 ... 63 = /dev/ttyCH63 AT/PCI-Fast board 3, port 15 165 char Chase Research AT/PCI-Fast serial card - alternate devices 0 = /dev/cuch0 Callout device for ttyCH0 ... 63 = /dev/cuch63 Callout device for ttyCH63 166 char ACM USB modems 0 = /dev/ttyACM0 First ACM modem 1 = /dev/ttyACM1 Second ACM modem ... 167 char ACM USB modems - alternate devices 0 = /dev/cuacm0 Callout device for ttyACM0 1 = /dev/cuacm1 Callout device for ttyACM1 ... 168 char Eracom CSA7000 PCI encryption adaptor 0 = /dev/ecsa0 First CSA7000 1 = /dev/ecsa1 Second CSA7000 ... 169 char Eracom CSA8000 PCI encryption adaptor 0 = /dev/ecsa8-0 First CSA8000 1 = /dev/ecsa8-1 Second CSA8000 ... 170 char AMI MegaRAC remote access controller 0 = /dev/megarac0 First MegaRAC card 1 = /dev/megarac1 Second MegaRAC card ... 171 char Reserved for IEEE 1394 (Firewire) 172 char Moxa Intellio serial card 0 = /dev/ttyMX0 First Moxa port 1 = /dev/ttyMX1 Second Moxa port ... 127 = /dev/ttyMX127 128th Moxa port 128 = /dev/moxactl Moxa control port 173 char Moxa Intellio serial card - alternate devices 0 = /dev/cumx0 Callout device for ttyMX0 1 = /dev/cumx1 Callout device for ttyMX1 ... 127 = /dev/cumx127 Callout device for ttyMX127 174 char SmartIO serial card 0 = /dev/ttySI0 First SmartIO port 1 = /dev/ttySI1 Second SmartIO port ... 175 char SmartIO serial card - alternate devices 0 = /dev/cusi0 Callout device for ttySI0 1 = /dev/cusi1 Callout device for ttySI1 ... 176 char nCipher nFast PCI crypto accelerator 0 = /dev/nfastpci0 First nFast PCI device 1 = /dev/nfastpci1 First nFast PCI device ... 177 char TI PCILynx memory spaces 0 = /dev/pcilynx/aux0 AUX space of first PCILynx card ... 15 = /dev/pcilynx/aux15 AUX space of 16th PCILynx card 16 = /dev/pcilynx/rom0 ROM space of first PCILynx card ... 31 = /dev/pcilynx/rom15 ROM space of 16th PCILynx card 32 = /dev/pcilynx/ram0 RAM space of first PCILynx card ... 47 = /dev/pcilynx/ram15 RAM space of 16th PCILynx card 178 char Giganet cLAN1xxx virtual interface adapter 0 = /dev/clanvi0 First cLAN adapter 1 = /dev/clanvi1 Second cLAN adapter ... 179 block MMC block devices 0 = /dev/mmcblk0 First SD/MMC card 1 = /dev/mmcblk0p1 First partition on first MMC card 8 = /dev/mmcblk1 Second SD/MMC card ... The start of next SD/MMC card can be configured with CONFIG_MMC_BLOCK_MINORS, or overridden at boot/modprobe time using the mmcblk.perdev_minors option. That would bump the offset between each card to be the configured value instead of the default 8. 179 char CCube DVXChip-based PCI products 0 = /dev/dvxirq0 First DVX device 1 = /dev/dvxirq1 Second DVX device ... 180 char USB devices 0 = /dev/usb/lp0 First USB printer ... 15 = /dev/usb/lp15 16th USB printer 48 = /dev/usb/scanner0 First USB scanner ... 63 = /dev/usb/scanner15 16th USB scanner 64 = /dev/usb/rio500 Diamond Rio 500 65 = /dev/usb/usblcd USBLCD Interface (info@usblcd.de) 66 = /dev/usb/cpad0 Synaptics cPad (mouse/LCD) 96 = /dev/usb/hiddev0 1st USB HID device ... 111 = /dev/usb/hiddev15 16th USB HID device 112 = /dev/usb/auer0 1st auerswald ISDN device ... 127 = /dev/usb/auer15 16th auerswald ISDN device 128 = /dev/usb/brlvgr0 First Braille Voyager device ... 131 = /dev/usb/brlvgr3 Fourth Braille Voyager device 132 = /dev/usb/idmouse ID Mouse (fingerprint scanner) device 133 = /dev/usb/sisusbvga1 First SiSUSB VGA device ... 140 = /dev/usb/sisusbvga8 Eighth SISUSB VGA device 144 = /dev/usb/lcd USB LCD device 160 = /dev/usb/legousbtower0 1st USB Legotower device ... 175 = /dev/usb/legousbtower15 16th USB Legotower device 176 = /dev/usb/usbtmc1 First USB TMC device ... 191 = /dev/usb/usbtmc16 16th USB TMC device 192 = /dev/usb/yurex1 First USB Yurex device ... 209 = /dev/usb/yurex16 16th USB Yurex device 240 = /dev/usb/dabusb0 First daubusb device ... 243 = /dev/usb/dabusb3 Fourth dabusb device 180 block USB block devices 0 = /dev/uba First USB block device 8 = /dev/ubb Second USB block device 16 = /dev/ubc Third USB block device ... 181 char Conrad Electronic parallel port radio clocks 0 = /dev/pcfclock0 First Conrad radio clock 1 = /dev/pcfclock1 Second Conrad radio clock ... 182 char Picture Elements THR2 binarizer 0 = /dev/pethr0 First THR2 board 1 = /dev/pethr1 Second THR2 board ... 183 char SST 5136-DN DeviceNet interface 0 = /dev/ss5136dn0 First DeviceNet interface 1 = /dev/ss5136dn1 Second DeviceNet interface ... This device used to be assigned to major number 144. It had to be moved due to an unfortunate conflict. 184 char Picture Elements' video simulator/sender 0 = /dev/pevss0 First sender board 1 = /dev/pevss1 Second sender board ... 185 char InterMezzo high availability file system 0 = /dev/intermezzo0 First cache manager 1 = /dev/intermezzo1 Second cache manager ... See http://web.archive.org/web/20080115195241/ http://inter-mezzo.org/index.html 186 char Object-based storage control device 0 = /dev/obd0 First obd control device 1 = /dev/obd1 Second obd control device ... See ftp://ftp.lustre.org/pub/obd for code and information. 187 char DESkey hardware encryption device 0 = /dev/deskey0 First DES key 1 = /dev/deskey1 Second DES key ... 188 char USB serial converters 0 = /dev/ttyUSB0 First USB serial converter 1 = /dev/ttyUSB1 Second USB serial converter ... 189 char USB serial converters - alternate devices 0 = /dev/cuusb0 Callout device for ttyUSB0 1 = /dev/cuusb1 Callout device for ttyUSB1 ... 190 char Kansas City tracker/tuner card 0 = /dev/kctt0 First KCT/T card 1 = /dev/kctt1 Second KCT/T card ... 191 char Reserved for PCMCIA 192 char Kernel profiling interface 0 = /dev/profile Profiling control device 1 = /dev/profile0 Profiling device for CPU 0 2 = /dev/profile1 Profiling device for CPU 1 ... 193 char Kernel event-tracing interface 0 = /dev/trace Tracing control device 1 = /dev/trace0 Tracing device for CPU 0 2 = /dev/trace1 Tracing device for CPU 1 ... 194 char linVideoStreams (LINVS) 0 = /dev/mvideo/status0 Video compression status 1 = /dev/mvideo/stream0 Video stream 2 = /dev/mvideo/frame0 Single compressed frame 3 = /dev/mvideo/rawframe0 Raw uncompressed frame 4 = /dev/mvideo/codec0 Direct codec access 5 = /dev/mvideo/video4linux0 Video4Linux compatibility 16 = /dev/mvideo/status1 Second device ... 32 = /dev/mvideo/status2 Third device ... ... 240 = /dev/mvideo/status15 16th device ... 195 char Nvidia graphics devices 0 = /dev/nvidia0 First Nvidia card 1 = /dev/nvidia1 Second Nvidia card ... 255 = /dev/nvidiactl Nvidia card control device 196 char Tormenta T1 card 0 = /dev/tor/0 Master control channel for all cards 1 = /dev/tor/1 First DS0 2 = /dev/tor/2 Second DS0 ... 48 = /dev/tor/48 48th DS0 49 = /dev/tor/49 First pseudo-channel 50 = /dev/tor/50 Second pseudo-channel ... 197 char OpenTNF tracing facility 0 = /dev/tnf/t0 Trace 0 data extraction 1 = /dev/tnf/t1 Trace 1 data extraction ... 128 = /dev/tnf/status Tracing facility status 130 = /dev/tnf/trace Tracing device 198 char Total Impact TPMP2 quad coprocessor PCI card 0 = /dev/tpmp2/0 First card 1 = /dev/tpmp2/1 Second card ... 199 char Veritas volume manager (VxVM) volumes 0 = /dev/vx/rdsk/*/* First volume 1 = /dev/vx/rdsk/*/* Second volume ... 199 block Veritas volume manager (VxVM) volumes 0 = /dev/vx/dsk/*/* First volume 1 = /dev/vx/dsk/*/* Second volume ... The namespace in these directories is maintained by the user space VxVM software. 200 char Veritas VxVM configuration interface 0 = /dev/vx/config Configuration access node 1 = /dev/vx/trace Volume i/o trace access node 2 = /dev/vx/iod Volume i/o daemon access node 3 = /dev/vx/info Volume information access node 4 = /dev/vx/task Volume tasks access node 5 = /dev/vx/taskmon Volume tasks monitor daemon 201 char Veritas VxVM dynamic multipathing driver 0 = /dev/vx/rdmp/* First multipath device 1 = /dev/vx/rdmp/* Second multipath device ... 201 block Veritas VxVM dynamic multipathing driver 0 = /dev/vx/dmp/* First multipath device 1 = /dev/vx/dmp/* Second multipath device ... The namespace in these directories is maintained by the user space VxVM software. 202 char CPU model-specific registers 0 = /dev/cpu/0/msr MSRs on CPU 0 1 = /dev/cpu/1/msr MSRs on CPU 1 ... 202 block Xen Virtual Block Device 0 = /dev/xvda First Xen VBD whole disk 16 = /dev/xvdb Second Xen VBD whole disk 32 = /dev/xvdc Third Xen VBD whole disk ... 240 = /dev/xvdp Sixteenth Xen VBD whole disk Partitions are handled in the same way as for IDE disks (see major number 3) except that the limit on partitions is 15. 203 char CPU CPUID information 0 = /dev/cpu/0/cpuid CPUID on CPU 0 1 = /dev/cpu/1/cpuid CPUID on CPU 1 ... 204 char Low-density serial ports 0 = /dev/ttyLU0 LinkUp Systems L72xx UART - port 0 1 = /dev/ttyLU1 LinkUp Systems L72xx UART - port 1 2 = /dev/ttyLU2 LinkUp Systems L72xx UART - port 2 3 = /dev/ttyLU3 LinkUp Systems L72xx UART - port 3 4 = /dev/ttyFB0 Intel Footbridge (ARM) 5 = /dev/ttySA0 StrongARM builtin serial port 0 6 = /dev/ttySA1 StrongARM builtin serial port 1 7 = /dev/ttySA2 StrongARM builtin serial port 2 8 = /dev/ttySC0 SCI serial port (SuperH) - port 0 9 = /dev/ttySC1 SCI serial port (SuperH) - port 1 10 = /dev/ttySC2 SCI serial port (SuperH) - port 2 11 = /dev/ttySC3 SCI serial port (SuperH) - port 3 12 = /dev/ttyFW0 Firmware console - port 0 13 = /dev/ttyFW1 Firmware console - port 1 14 = /dev/ttyFW2 Firmware console - port 2 15 = /dev/ttyFW3 Firmware console - port 3 16 = /dev/ttyAM0 ARM "AMBA" serial port 0 ... 31 = /dev/ttyAM15 ARM "AMBA" serial port 15 32 = /dev/ttyDB0 DataBooster serial port 0 ... 39 = /dev/ttyDB7 DataBooster serial port 7 40 = /dev/ttySG0 SGI Altix console port 41 = /dev/ttySMX0 Motorola i.MX - port 0 42 = /dev/ttySMX1 Motorola i.MX - port 1 43 = /dev/ttySMX2 Motorola i.MX - port 2 44 = /dev/ttyMM0 Marvell MPSC - port 0 45 = /dev/ttyMM1 Marvell MPSC - port 1 46 = /dev/ttyCPM0 PPC CPM (SCC or SMC) - port 0 ... 47 = /dev/ttyCPM5 PPC CPM (SCC or SMC) - port 5 50 = /dev/ttyIOC0 Altix serial card ... 81 = /dev/ttyIOC31 Altix serial card 82 = /dev/ttyVR0 NEC VR4100 series SIU 83 = /dev/ttyVR1 NEC VR4100 series DSIU 84 = /dev/ttyIOC84 Altix ioc4 serial card ... 115 = /dev/ttyIOC115 Altix ioc4 serial card 116 = /dev/ttySIOC0 Altix ioc3 serial card ... 147 = /dev/ttySIOC31 Altix ioc3 serial card 148 = /dev/ttyPSC0 PPC PSC - port 0 ... 153 = /dev/ttyPSC5 PPC PSC - port 5 154 = /dev/ttyAT0 ATMEL serial port 0 ... 169 = /dev/ttyAT15 ATMEL serial port 15 170 = /dev/ttyNX0 Hilscher netX serial port 0 ... 185 = /dev/ttyNX15 Hilscher netX serial port 15 186 = /dev/ttyJ0 JTAG1 DCC protocol based serial port emulation 187 = /dev/ttyUL0 Xilinx uartlite - port 0 ... 190 = /dev/ttyUL3 Xilinx uartlite - port 3 191 = /dev/xvc0 Xen virtual console - port 0 192 = /dev/ttyPZ0 pmac_zilog - port 0 ... 195 = /dev/ttyPZ3 pmac_zilog - port 3 196 = /dev/ttyTX0 TX39/49 serial port 0 ... 204 = /dev/ttyTX7 TX39/49 serial port 7 205 = /dev/ttySC0 SC26xx serial port 0 206 = /dev/ttySC1 SC26xx serial port 1 207 = /dev/ttySC2 SC26xx serial port 2 208 = /dev/ttySC3 SC26xx serial port 3 209 = /dev/ttyMAX0 MAX3100 serial port 0 210 = /dev/ttyMAX1 MAX3100 serial port 1 211 = /dev/ttyMAX2 MAX3100 serial port 2 212 = /dev/ttyMAX3 MAX3100 serial port 3 205 char Low-density serial ports (alternate device) 0 = /dev/culu0 Callout device for ttyLU0 1 = /dev/culu1 Callout device for ttyLU1 2 = /dev/culu2 Callout device for ttyLU2 3 = /dev/culu3 Callout device for ttyLU3 4 = /dev/cufb0 Callout device for ttyFB0 5 = /dev/cusa0 Callout device for ttySA0 6 = /dev/cusa1 Callout device for ttySA1 7 = /dev/cusa2 Callout device for ttySA2 8 = /dev/cusc0 Callout device for ttySC0 9 = /dev/cusc1 Callout device for ttySC1 10 = /dev/cusc2 Callout device for ttySC2 11 = /dev/cusc3 Callout device for ttySC3 12 = /dev/cufw0 Callout device for ttyFW0 13 = /dev/cufw1 Callout device for ttyFW1 14 = /dev/cufw2 Callout device for ttyFW2 15 = /dev/cufw3 Callout device for ttyFW3 16 = /dev/cuam0 Callout device for ttyAM0 ... 31 = /dev/cuam15 Callout device for ttyAM15 32 = /dev/cudb0 Callout device for ttyDB0 ... 39 = /dev/cudb7 Callout device for ttyDB7 40 = /dev/cusg0 Callout device for ttySG0 41 = /dev/ttycusmx0 Callout device for ttySMX0 42 = /dev/ttycusmx1 Callout device for ttySMX1 43 = /dev/ttycusmx2 Callout device for ttySMX2 46 = /dev/cucpm0 Callout device for ttyCPM0 ... 49 = /dev/cucpm5 Callout device for ttyCPM5 50 = /dev/cuioc40 Callout device for ttyIOC40 ... 81 = /dev/cuioc431 Callout device for ttyIOC431 82 = /dev/cuvr0 Callout device for ttyVR0 83 = /dev/cuvr1 Callout device for ttyVR1 206 char OnStream SC-x0 tape devices 0 = /dev/osst0 First OnStream SCSI tape, mode 0 1 = /dev/osst1 Second OnStream SCSI tape, mode 0 ... 32 = /dev/osst0l First OnStream SCSI tape, mode 1 33 = /dev/osst1l Second OnStream SCSI tape, mode 1 ... 64 = /dev/osst0m First OnStream SCSI tape, mode 2 65 = /dev/osst1m Second OnStream SCSI tape, mode 2 ... 96 = /dev/osst0a First OnStream SCSI tape, mode 3 97 = /dev/osst1a Second OnStream SCSI tape, mode 3 ... 128 = /dev/nosst0 No rewind version of /dev/osst0 129 = /dev/nosst1 No rewind version of /dev/osst1 ... 160 = /dev/nosst0l No rewind version of /dev/osst0l 161 = /dev/nosst1l No rewind version of /dev/osst1l ... 192 = /dev/nosst0m No rewind version of /dev/osst0m 193 = /dev/nosst1m No rewind version of /dev/osst1m ... 224 = /dev/nosst0a No rewind version of /dev/osst0a 225 = /dev/nosst1a No rewind version of /dev/osst1a ... The OnStream SC-x0 SCSI tapes do not support the standard SCSI SASD command set and therefore need their own driver "osst". Note that the IDE, USB (and maybe ParPort) versions may be driven via ide-scsi or usb-storage SCSI emulation and this osst device and driver as well. The ADR-x0 drives are QIC-157 compliant and don't need osst. 207 char Compaq ProLiant health feature indicate 0 = /dev/cpqhealth/cpqw Redirector interface 1 = /dev/cpqhealth/crom EISA CROM 2 = /dev/cpqhealth/cdt Data Table 3 = /dev/cpqhealth/cevt Event Log 4 = /dev/cpqhealth/casr Automatic Server Recovery 5 = /dev/cpqhealth/cecc ECC Memory 6 = /dev/cpqhealth/cmca Machine Check Architecture 7 = /dev/cpqhealth/ccsm Deprecated CDT 8 = /dev/cpqhealth/cnmi NMI Handling 9 = /dev/cpqhealth/css Sideshow Management 10 = /dev/cpqhealth/cram CMOS interface 11 = /dev/cpqhealth/cpci PCI IRQ interface 208 char User space serial ports 0 = /dev/ttyU0 First user space serial port 1 = /dev/ttyU1 Second user space serial port ... 209 char User space serial ports (alternate devices) 0 = /dev/cuu0 Callout device for ttyU0 1 = /dev/cuu1 Callout device for ttyU1 ... 210 char SBE, Inc. sync/async serial card 0 = /dev/sbei/wxcfg0 Configuration device for board 0 1 = /dev/sbei/dld0 Download device for board 0 2 = /dev/sbei/wan00 WAN device, port 0, board 0 3 = /dev/sbei/wan01 WAN device, port 1, board 0 4 = /dev/sbei/wan02 WAN device, port 2, board 0 5 = /dev/sbei/wan03 WAN device, port 3, board 0 6 = /dev/sbei/wanc00 WAN clone device, port 0, board 0 7 = /dev/sbei/wanc01 WAN clone device, port 1, board 0 8 = /dev/sbei/wanc02 WAN clone device, port 2, board 0 9 = /dev/sbei/wanc03 WAN clone device, port 3, board 0 10 = /dev/sbei/wxcfg1 Configuration device for board 1 11 = /dev/sbei/dld1 Download device for board 1 12 = /dev/sbei/wan10 WAN device, port 0, board 1 13 = /dev/sbei/wan11 WAN device, port 1, board 1 14 = /dev/sbei/wan12 WAN device, port 2, board 1 15 = /dev/sbei/wan13 WAN device, port 3, board 1 16 = /dev/sbei/wanc10 WAN clone device, port 0, board 1 17 = /dev/sbei/wanc11 WAN clone device, port 1, board 1 18 = /dev/sbei/wanc12 WAN clone device, port 2, board 1 19 = /dev/sbei/wanc13 WAN clone device, port 3, board 1 ... Yes, each board is really spaced 10 (decimal) apart. 211 char Addinum CPCI1500 digital I/O card 0 = /dev/addinum/cpci1500/0 First CPCI1500 card 1 = /dev/addinum/cpci1500/1 Second CPCI1500 card ... 212 char LinuxTV.org DVB driver subsystem 0 = /dev/dvb/adapter0/video0 first video decoder of first card 1 = /dev/dvb/adapter0/audio0 first audio decoder of first card 2 = /dev/dvb/adapter0/sec0 (obsolete/unused) 3 = /dev/dvb/adapter0/frontend0 first frontend device of first card 4 = /dev/dvb/adapter0/demux0 first demux device of first card 5 = /dev/dvb/adapter0/dvr0 first digital video recoder device of first card 6 = /dev/dvb/adapter0/ca0 first common access port of first card 7 = /dev/dvb/adapter0/net0 first network device of first card 8 = /dev/dvb/adapter0/osd0 first on-screen-display device of first card 9 = /dev/dvb/adapter0/video1 second video decoder of first card ... 64 = /dev/dvb/adapter1/video0 first video decoder of second card ... 128 = /dev/dvb/adapter2/video0 first video decoder of third card ... 196 = /dev/dvb/adapter3/video0 first video decoder of fourth card 216 char Bluetooth RFCOMM TTY devices 0 = /dev/rfcomm0 First Bluetooth RFCOMM TTY device 1 = /dev/rfcomm1 Second Bluetooth RFCOMM TTY device ... 217 char Bluetooth RFCOMM TTY devices (alternate devices) 0 = /dev/curf0 Callout device for rfcomm0 1 = /dev/curf1 Callout device for rfcomm1 ... 218 char The Logical Company bus Unibus/Qbus adapters 0 = /dev/logicalco/bci/0 First bus adapter 1 = /dev/logicalco/bci/1 First bus adapter ... 219 char The Logical Company DCI-1300 digital I/O card 0 = /dev/logicalco/dci1300/0 First DCI-1300 card 1 = /dev/logicalco/dci1300/1 Second DCI-1300 card ... 220 char Myricom Myrinet "GM" board 0 = /dev/myricom/gm0 First Myrinet GM board 1 = /dev/myricom/gmp0 First board "root access" 2 = /dev/myricom/gm1 Second Myrinet GM board 3 = /dev/myricom/gmp1 Second board "root access" ... 221 char VME bus 0 = /dev/bus/vme/m0 First master image 1 = /dev/bus/vme/m1 Second master image 2 = /dev/bus/vme/m2 Third master image 3 = /dev/bus/vme/m3 Fourth master image 4 = /dev/bus/vme/s0 First slave image 5 = /dev/bus/vme/s1 Second slave image 6 = /dev/bus/vme/s2 Third slave image 7 = /dev/bus/vme/s3 Fourth slave image 8 = /dev/bus/vme/ctl Control It is expected that all VME bus drivers will use the same interface. For interface documentation see http://www.vmelinux.org/. 224 char A2232 serial card 0 = /dev/ttyY0 First A2232 port 1 = /dev/ttyY1 Second A2232 port ... 225 char A2232 serial card (alternate devices) 0 = /dev/cuy0 Callout device for ttyY0 1 = /dev/cuy1 Callout device for ttyY1 ... 226 char Direct Rendering Infrastructure (DRI) 0 = /dev/dri/card0 First graphics card 1 = /dev/dri/card1 Second graphics card ... 227 char IBM 3270 terminal Unix tty access 1 = /dev/3270/tty1 First 3270 terminal 2 = /dev/3270/tty2 Seconds 3270 terminal ... 228 char IBM 3270 terminal block-mode access 0 = /dev/3270/tub Controlling interface 1 = /dev/3270/tub1 First 3270 terminal 2 = /dev/3270/tub2 Second 3270 terminal ... 229 char IBM iSeries/pSeries virtual console 0 = /dev/hvc0 First console port 1 = /dev/hvc1 Second console port ... 230 char IBM iSeries virtual tape 0 = /dev/iseries/vt0 First virtual tape, mode 0 1 = /dev/iseries/vt1 Second virtual tape, mode 0 ... 32 = /dev/iseries/vt0l First virtual tape, mode 1 33 = /dev/iseries/vt1l Second virtual tape, mode 1 ... 64 = /dev/iseries/vt0m First virtual tape, mode 2 65 = /dev/iseries/vt1m Second virtual tape, mode 2 ... 96 = /dev/iseries/vt0a First virtual tape, mode 3 97 = /dev/iseries/vt1a Second virtual tape, mode 3 ... 128 = /dev/iseries/nvt0 First virtual tape, mode 0, no rewind 129 = /dev/iseries/nvt1 Second virtual tape, mode 0, no rewind ... 160 = /dev/iseries/nvt0l First virtual tape, mode 1, no rewind 161 = /dev/iseries/nvt1l Second virtual tape, mode 1, no rewind ... 192 = /dev/iseries/nvt0m First virtual tape, mode 2, no rewind 193 = /dev/iseries/nvt1m Second virtual tape, mode 2, no rewind ... 224 = /dev/iseries/nvt0a First virtual tape, mode 3, no rewind 225 = /dev/iseries/nvt1a Second virtual tape, mode 3, no rewind ... "No rewind" refers to the omission of the default automatic rewind on device close. The MTREW or MTOFFL ioctl()'s can be used to rewind the tape regardless of the device used to access it. 231 char InfiniBand 0 = /dev/infiniband/umad0 1 = /dev/infiniband/umad1 ... 63 = /dev/infiniband/umad63 63rd InfiniBandMad device 64 = /dev/infiniband/issm0 First InfiniBand IsSM device 65 = /dev/infiniband/issm1 Second InfiniBand IsSM device ... 127 = /dev/infiniband/issm63 63rd InfiniBand IsSM device 128 = /dev/infiniband/uverbs0 First InfiniBand verbs device 129 = /dev/infiniband/uverbs1 Second InfiniBand verbs device ... 159 = /dev/infiniband/uverbs31 31st InfiniBand verbs device 232 char Biometric Devices 0 = /dev/biometric/sensor0/fingerprint first fingerprint sensor on first device 1 = /dev/biometric/sensor0/iris first iris sensor on first device 2 = /dev/biometric/sensor0/retina first retina sensor on first device 3 = /dev/biometric/sensor0/voiceprint first voiceprint sensor on first device 4 = /dev/biometric/sensor0/facial first facial sensor on first device 5 = /dev/biometric/sensor0/hand first hand sensor on first device ... 10 = /dev/biometric/sensor1/fingerprint first fingerprint sensor on second device ... 20 = /dev/biometric/sensor2/fingerprint first fingerprint sensor on third device ... 233 char PathScale InfiniPath interconnect 0 = /dev/ipath Primary device for programs (any unit) 1 = /dev/ipath0 Access specifically to unit 0 2 = /dev/ipath1 Access specifically to unit 1 ... 4 = /dev/ipath3 Access specifically to unit 3 129 = /dev/ipath_sma Device used by Subnet Management Agent 130 = /dev/ipath_diag Device used by diagnostics programs 234-239 UNASSIGNED 240-254 char LOCAL/EXPERIMENTAL USE 240-254 block LOCAL/EXPERIMENTAL USE Allocated for local/experimental use. For devices not assigned official numbers, these ranges should be used in order to avoid conflicting with future assignments. 255 char RESERVED 255 block RESERVED This major is reserved to assist the expansion to a larger number space. No device nodes with this major should ever be created on the filesystem. (This is probably not true anymore, but I'll leave it for now /Torben) ---LARGE MAJORS!!!!!--- 256 char Equinox SST multi-port serial boards 0 = /dev/ttyEQ0 First serial port on first Equinox SST board 127 = /dev/ttyEQ127 Last serial port on first Equinox SST board 128 = /dev/ttyEQ128 First serial port on second Equinox SST board ... 1027 = /dev/ttyEQ1027 Last serial port on eighth Equinox SST board 256 block Resident Flash Disk Flash Translation Layer 0 = /dev/rfda First RFD FTL layer 16 = /dev/rfdb Second RFD FTL layer ... 240 = /dev/rfdp 16th RFD FTL layer 257 char Phoenix Technologies Cryptographic Services Driver 0 = /dev/ptlsec Crypto Services Driver 257 block SSFDC Flash Translation Layer filesystem 0 = /dev/ssfdca First SSFDC layer 8 = /dev/ssfdcb Second SSFDC layer 16 = /dev/ssfdcc Third SSFDC layer 24 = /dev/ssfdcd 4th SSFDC layer 32 = /dev/ssfdce 5th SSFDC layer 40 = /dev/ssfdcf 6th SSFDC layer 48 = /dev/ssfdcg 7th SSFDC layer 56 = /dev/ssfdch 8th SSFDC layer 258 block ROM/Flash read-only translation layer 0 = /dev/blockrom0 First ROM card's translation layer interface 1 = /dev/blockrom1 Second ROM card's translation layer interface ... 259 block Block Extended Major Used dynamically to hold additional partition minor numbers and allow large numbers of partitions per device 259 char FPGA configuration interfaces 0 = /dev/icap0 First Xilinx internal configuration 1 = /dev/icap1 Second Xilinx internal configuration 260 char OSD (Object-based-device) SCSI Device 0 = /dev/osd0 First OSD Device 1 = /dev/osd1 Second OSD Device ... 255 = /dev/osd255 256th OSD Device **** ADDITIONAL /dev DIRECTORY ENTRIES This section details additional entries that should or may exist in the /dev directory. It is preferred that symbolic links use the same form (absolute or relative) as is indicated here. Links are classified as "hard" or "symbolic" depending on the preferred type of link; if possible, the indicated type of link should be used. Compulsory links These links should exist on all systems: /dev/fd /proc/self/fd symbolic File descriptors /dev/stdin fd/0 symbolic stdin file descriptor /dev/stdout fd/1 symbolic stdout file descriptor /dev/stderr fd/2 symbolic stderr file descriptor /dev/nfsd socksys symbolic Required by iBCS-2 /dev/X0R null symbolic Required by iBCS-2 Note: /dev/X0R is --. Recommended links It is recommended that these links exist on all systems: /dev/core /proc/kcore symbolic Backward compatibility /dev/ramdisk ram0 symbolic Backward compatibility /dev/ftape qft0 symbolic Backward compatibility /dev/bttv0 video0 symbolic Backward compatibility /dev/radio radio0 symbolic Backward compatibility /dev/i2o* /dev/i2o/* symbolic Backward compatibility /dev/scd? sr? hard Alternate SCSI CD-ROM name Locally defined links The following links may be established locally to conform to the configuration of the system. This is merely a tabulation of existing practice, and does not constitute a recommendation. However, if they exist, they should have the following uses. /dev/mouse mouse port symbolic Current mouse device /dev/tape tape device symbolic Current tape device /dev/cdrom CD-ROM device symbolic Current CD-ROM device /dev/cdwriter CD-writer symbolic Current CD-writer device /dev/scanner scanner symbolic Current scanner device /dev/modem modem port symbolic Current dialout device /dev/root root device symbolic Current root filesystem /dev/swap swap device symbolic Current swap device /dev/modem should not be used for a modem which supports dialin as well as dialout, as it tends to cause lock file problems. If it exists, /dev/modem should point to the appropriate primary TTY device (the use of the alternate callout devices is deprecated). For SCSI devices, /dev/tape and /dev/cdrom should point to the ``cooked'' devices (/dev/st* and /dev/sr*, respectively), whereas /dev/cdwriter and /dev/scanner should point to the appropriate generic SCSI devices (/dev/sg*). /dev/mouse may point to a primary serial TTY device, a hardware mouse device, or a socket for a mouse driver program (e.g. /dev/gpmdata). Sockets and pipes Non-transient sockets and named pipes may exist in /dev. Common entries are: /dev/printer socket lpd local socket /dev/log socket syslog local socket /dev/gpmdata socket gpm mouse multiplexer Mount points The following names are reserved for mounting special filesystems under /dev. These special filesystems provide kernel interfaces that cannot be provided with standard device nodes. /dev/pts devpts PTY slave filesystem /dev/shm tmpfs POSIX shared memory maintenance access **** TERMINAL DEVICES Terminal, or TTY devices are a special class of character devices. A terminal device is any device that could act as a controlling terminal for a session; this includes virtual consoles, serial ports, and pseudoterminals (PTYs). All terminal devices share a common set of capabilities known as line disciplines; these include the common terminal line discipline as well as SLIP and PPP modes. All terminal devices are named similarly; this section explains the naming and use of the various types of TTYs. Note that the naming conventions include several historical warts; some of these are Linux-specific, some were inherited from other systems, and some reflect Linux outgrowing a borrowed convention. A hash mark (#) in a device name is used here to indicate a decimal number without leading zeroes. Virtual consoles and the console device Virtual consoles are full-screen terminal displays on the system video monitor. Virtual consoles are named /dev/tty#, with numbering starting at /dev/tty1; /dev/tty0 is the current virtual console. /dev/tty0 is the device that should be used to access the system video card on those architectures for which the frame buffer devices (/dev/fb*) are not applicable. Do not use /dev/console for this purpose. The console device, /dev/console, is the device to which system messages should be sent, and on which logins should be permitted in single-user mode. Starting with Linux 2.1.71, /dev/console is managed by the kernel; for previous versions it should be a symbolic link to either /dev/tty0, a specific virtual console such as /dev/tty1, or to a serial port primary (tty*, not cu*) device, depending on the configuration of the system. Serial ports Serial ports are RS-232 serial ports and any device which simulates one, either in hardware (such as internal modems) or in software (such as the ISDN driver.) Under Linux, each serial ports has two device names, the primary or callin device and the alternate or callout one. Each kind of device is indicated by a different letter. For any letter X, the names of the devices are /dev/ttyX# and /dev/cux#, respectively; for historical reasons, /dev/ttyS# and /dev/ttyC# correspond to /dev/cua# and /dev/cub#. In the future, it should be expected that multiple letters will be used; all letters will be upper case for the "tty" device (e.g. /dev/ttyDP#) and lower case for the "cu" device (e.g. /dev/cudp#). The names /dev/ttyQ# and /dev/cuq# are reserved for local use. The alternate devices provide for kernel-based exclusion and somewhat different defaults than the primary devices. Their main purpose is to allow the use of serial ports with programs with no inherent or broken support for serial ports. Their use is deprecated, and they may be removed from a future version of Linux. Arbitration of serial ports is provided by the use of lock files with the names /var/lock/LCK..ttyX#. The contents of the lock file should be the PID of the locking process as an ASCII number. It is common practice to install links such as /dev/modem which point to serial ports. In order to ensure proper locking in the presence of these links, it is recommended that software chase symlinks and lock all possible names; additionally, it is recommended that a lock file be installed with the corresponding alternate device. In order to avoid deadlocks, it is recommended that the locks are acquired in the following order, and released in the reverse: 1. The symbolic link name, if any (/var/lock/LCK..modem) 2. The "tty" name (/var/lock/LCK..ttyS2) 3. The alternate device name (/var/lock/LCK..cua2) In the case of nested symbolic links, the lock files should be installed in the order the symlinks are resolved. Under no circumstances should an application hold a lock while waiting for another to be released. In addition, applications which attempt to create lock files for the corresponding alternate device names should take into account the possibility of being used on a non-serial port TTY, for which no alternate device would exist. Pseudoterminals (PTYs) Pseudoterminals, or PTYs, are used to create login sessions or provide other capabilities requiring a TTY line discipline (including SLIP or PPP capability) to arbitrary data-generation processes. Each PTY has a master side, named /dev/pty[p-za-e][0-9a-f], and a slave side, named /dev/tty[p-za-e][0-9a-f]. The kernel arbitrates the use of PTYs by allowing each master side to be opened only once. Once the master side has been opened, the corresponding slave device can be used in the same manner as any TTY device. The master and slave devices are connected by the kernel, generating the equivalent of a bidirectional pipe with TTY capabilities. Recent versions of the Linux kernels and GNU libc contain support for the System V/Unix98 naming scheme for PTYs, which assigns a common device, /dev/ptmx, to all the masters (opening it will automatically give you a previously unassigned PTY) and a subdirectory, /dev/pts, for the slaves; the slaves are named with decimal integers (/dev/pts/# in our notation). This removes the problem of exhausting the namespace and enables the kernel to automatically create the device nodes for the slaves on demand using the "devpts" filesystem. Digital Signature Verification API CONTENTS 1. Introduction 2. API 3. User-space utilities 1. Introduction Digital signature verification API provides a method to verify digital signature. Currently digital signatures are used by the IMA/EVM integrity protection subsystem. Digital signature verification is implemented using cut-down kernel port of GnuPG multi-precision integers (MPI) library. The kernel port provides memory allocation errors handling, has been refactored according to kernel coding style, and checkpatch.pl reported errors and warnings have been fixed. Public key and signature consist of header and MPIs. struct pubkey_hdr { uint8_t version; /* key format version */ time_t timestamp; /* key made, always 0 for now */ uint8_t algo; uint8_t nmpi; char mpi[0]; } __packed; struct signature_hdr { uint8_t version; /* signature format version */ time_t timestamp; /* signature made */ uint8_t algo; uint8_t hash; uint8_t keyid[8]; uint8_t nmpi; char mpi[0]; } __packed; keyid equals to SHA1[12-19] over the total key content. Signature header is used as an input to generate a signature. Such approach insures that key or signature header could not be changed. It protects timestamp from been changed and can be used for rollback protection. 2. API API currently includes only 1 function: digsig_verify() - digital signature verification with public key /** * digsig_verify() - digital signature verification with public key * @keyring: keyring to search key in * @sig: digital signature * @sigen: length of the signature * @data: data * @datalen: length of the data * @return: 0 on success, -EINVAL otherwise * * Verifies data integrity against digital signature. * Currently only RSA is supported. * Normally hash of the content is used as a data for this function. * */ int digsig_verify(struct key *keyring, const char *sig, int siglen, const char *data, int datalen); 3. User-space utilities The signing and key management utilities evm-utils provide functionality to generate signatures, to load keys into the kernel keyring. Keys can be in PEM or converted to the kernel format. When the key is added to the kernel keyring, the keyid defines the name of the key: 5D2B05FC633EE3E8 in the example bellow. Here is example output of the keyctl utility. $ keyctl show Session Keyring -3 --alswrv 0 0 keyring: _ses 603976250 --alswrv 0 -1 \_ keyring: _uid.0 817777377 --alswrv 0 0 \_ user: kmk 891974900 --alswrv 0 0 \_ encrypted: evm-key 170323636 --alswrv 0 0 \_ keyring: _module 548221616 --alswrv 0 0 \_ keyring: _ima 128198054 --alswrv 0 0 \_ keyring: _evm $ keyctl list 128198054 1 key in keyring: 620789745: --alswrv 0 0 user: 5D2B05FC633EE3E8 Dmitry Kasatkin 06.10.2011 Dynamic DMA mapping using the generic device ============================================ James E.J. Bottomley This document describes the DMA API. For a more gentle introduction of the API (and actual examples) see Documentation/DMA-API-HOWTO.txt. This API is split into two pieces. Part I describes the API. Part II describes the extensions to the API for supporting non-consistent memory machines. Unless you know that your driver absolutely has to support non-consistent platforms (this is usually only legacy platforms) you should only use the API described in part I. Part I - dma_ API ------------------------------------- To get the dma_ API, you must #include Part Ia - Using large dma-coherent buffers ------------------------------------------ void * dma_alloc_coherent(struct device *dev, size_t size, dma_addr_t *dma_handle, gfp_t flag) Consistent memory is memory for which a write by either the device or the processor can immediately be read by the processor or device without having to worry about caching effects. (You may however need to make sure to flush the processor's write buffers before telling devices to read that memory.) This routine allocates a region of bytes of consistent memory. It also returns a which may be cast to an unsigned integer the same width as the bus and used as the physical address base of the region. Returns: a pointer to the allocated region (in the processor's virtual address space) or NULL if the allocation failed. Note: consistent memory can be expensive on some platforms, and the minimum allocation length may be as big as a page, so you should consolidate your requests for consistent memory as much as possible. The simplest way to do that is to use the dma_pool calls (see below). The flag parameter (dma_alloc_coherent only) allows the caller to specify the GFP_ flags (see kmalloc) for the allocation (the implementation may choose to ignore flags that affect the location of the returned memory, like GFP_DMA). void * dma_zalloc_coherent(struct device *dev, size_t size, dma_addr_t *dma_handle, gfp_t flag) Wraps dma_alloc_coherent() and also zeroes the returned memory if the allocation attempt succeeded. void dma_free_coherent(struct device *dev, size_t size, void *cpu_addr, dma_addr_t dma_handle) Free the region of consistent memory you previously allocated. dev, size and dma_handle must all be the same as those passed into the consistent allocate. cpu_addr must be the virtual address returned by the consistent allocate. Note that unlike their sibling allocation calls, these routines may only be called with IRQs enabled. Part Ib - Using small dma-coherent buffers ------------------------------------------ To get this part of the dma_ API, you must #include Many drivers need lots of small dma-coherent memory regions for DMA descriptors or I/O buffers. Rather than allocating in units of a page or more using dma_alloc_coherent(), you can use DMA pools. These work much like a struct kmem_cache, except that they use the dma-coherent allocator, not __get_free_pages(). Also, they understand common hardware constraints for alignment, like queue heads needing to be aligned on N-byte boundaries. struct dma_pool * dma_pool_create(const char *name, struct device *dev, size_t size, size_t align, size_t alloc); The pool create() routines initialize a pool of dma-coherent buffers for use with a given device. It must be called in a context which can sleep. The "name" is for diagnostics (like a struct kmem_cache name); dev and size are like what you'd pass to dma_alloc_coherent(). The device's hardware alignment requirement for this type of data is "align" (which is expressed in bytes, and must be a power of two). If your device has no boundary crossing restrictions, pass 0 for alloc; passing 4096 says memory allocated from this pool must not cross 4KByte boundaries. void *dma_pool_alloc(struct dma_pool *pool, gfp_t gfp_flags, dma_addr_t *dma_handle); This allocates memory from the pool; the returned memory will meet the size and alignment requirements specified at creation time. Pass GFP_ATOMIC to prevent blocking, or if it's permitted (not in_interrupt, not holding SMP locks), pass GFP_KERNEL to allow blocking. Like dma_alloc_coherent(), this returns two values: an address usable by the cpu, and the dma address usable by the pool's device. void dma_pool_free(struct dma_pool *pool, void *vaddr, dma_addr_t addr); This puts memory back into the pool. The pool is what was passed to the pool allocation routine; the cpu (vaddr) and dma addresses are what were returned when that routine allocated the memory being freed. void dma_pool_destroy(struct dma_pool *pool); The pool destroy() routines free the resources of the pool. They must be called in a context which can sleep. Make sure you've freed all allocated memory back to the pool before you destroy it. Part Ic - DMA addressing limitations ------------------------------------ int dma_supported(struct device *dev, u64 mask) Checks to see if the device can support DMA to the memory described by mask. Returns: 1 if it can and 0 if it can't. Notes: This routine merely tests to see if the mask is possible. It won't change the current mask settings. It is more intended as an internal API for use by the platform than an external API for use by driver writers. int dma_set_mask(struct device *dev, u64 mask) Checks to see if the mask is possible and updates the device parameters if it is. Returns: 0 if successful and a negative error if not. int dma_set_coherent_mask(struct device *dev, u64 mask) Checks to see if the mask is possible and updates the device parameters if it is. Returns: 0 if successful and a negative error if not. u64 dma_get_required_mask(struct device *dev) This API returns the mask that the platform requires to operate efficiently. Usually this means the returned mask is the minimum required to cover all of memory. Examining the required mask gives drivers with variable descriptor sizes the opportunity to use smaller descriptors as necessary. Requesting the required mask does not alter the current mask. If you wish to take advantage of it, you should issue a dma_set_mask() call to set the mask to the value returned. Part Id - Streaming DMA mappings -------------------------------- dma_addr_t dma_map_single(struct device *dev, void *cpu_addr, size_t size, enum dma_data_direction direction) Maps a piece of processor virtual memory so it can be accessed by the device and returns the physical handle of the memory. The direction for both api's may be converted freely by casting. However the dma_ API uses a strongly typed enumerator for its direction: DMA_NONE no direction (used for debugging) DMA_TO_DEVICE data is going from the memory to the device DMA_FROM_DEVICE data is coming from the device to the memory DMA_BIDIRECTIONAL direction isn't known Notes: Not all memory regions in a machine can be mapped by this API. Further, regions that appear to be physically contiguous in kernel virtual space may not be contiguous as physical memory. Since this API does not provide any scatter/gather capability, it will fail if the user tries to map a non-physically contiguous piece of memory. For this reason, it is recommended that memory mapped by this API be obtained only from sources which guarantee it to be physically contiguous (like kmalloc). Further, the physical address of the memory must be within the dma_mask of the device (the dma_mask represents a bit mask of the addressable region for the device. I.e., if the physical address of the memory anded with the dma_mask is still equal to the physical address, then the device can perform DMA to the memory). In order to ensure that the memory allocated by kmalloc is within the dma_mask, the driver may specify various platform-dependent flags to restrict the physical memory range of the allocation (e.g. on x86, GFP_DMA guarantees to be within the first 16Mb of available physical memory, as required by ISA devices). Note also that the above constraints on physical contiguity and dma_mask may not apply if the platform has an IOMMU (a device which supplies a physical to virtual mapping between the I/O memory bus and the device). However, to be portable, device driver writers may *not* assume that such an IOMMU exists. Warnings: Memory coherency operates at a granularity called the cache line width. In order for memory mapped by this API to operate correctly, the mapped region must begin exactly on a cache line boundary and end exactly on one (to prevent two separately mapped regions from sharing a single cache line). Since the cache line size may not be known at compile time, the API will not enforce this requirement. Therefore, it is recommended that driver writers who don't take special care to determine the cache line size at run time only map virtual regions that begin and end on page boundaries (which are guaranteed also to be cache line boundaries). DMA_TO_DEVICE synchronisation must be done after the last modification of the memory region by the software and before it is handed off to the driver. Once this primitive is used, memory covered by this primitive should be treated as read-only by the device. If the device may write to it at any point, it should be DMA_BIDIRECTIONAL (see below). DMA_FROM_DEVICE synchronisation must be done before the driver accesses data that may be changed by the device. This memory should be treated as read-only by the driver. If the driver needs to write to it at any point, it should be DMA_BIDIRECTIONAL (see below). DMA_BIDIRECTIONAL requires special handling: it means that the driver isn't sure if the memory was modified before being handed off to the device and also isn't sure if the device will also modify it. Thus, you must always sync bidirectional memory twice: once before the memory is handed off to the device (to make sure all memory changes are flushed from the processor) and once before the data may be accessed after being used by the device (to make sure any processor cache lines are updated with data that the device may have changed). void dma_unmap_single(struct device *dev, dma_addr_t dma_addr, size_t size, enum dma_data_direction direction) Unmaps the region previously mapped. All the parameters passed in must be identical to those passed in (and returned) by the mapping API. dma_addr_t dma_map_page(struct device *dev, struct page *page, unsigned long offset, size_t size, enum dma_data_direction direction) void dma_unmap_page(struct device *dev, dma_addr_t dma_address, size_t size, enum dma_data_direction direction) API for mapping and unmapping for pages. All the notes and warnings for the other mapping APIs apply here. Also, although the and parameters are provided to do partial page mapping, it is recommended that you never use these unless you really know what the cache width is. int dma_mapping_error(struct device *dev, dma_addr_t dma_addr) In some circumstances dma_map_single and dma_map_page will fail to create a mapping. A driver can check for these errors by testing the returned dma address with dma_mapping_error(). A non-zero return value means the mapping could not be created and the driver should take appropriate action (e.g. reduce current DMA mapping usage or delay and try again later). int dma_map_sg(struct device *dev, struct scatterlist *sg, int nents, enum dma_data_direction direction) Returns: the number of physical segments mapped (this may be shorter than passed in if some elements of the scatter/gather list are physically or virtually adjacent and an IOMMU maps them with a single entry). Please note that the sg cannot be mapped again if it has been mapped once. The mapping process is allowed to destroy information in the sg. As with the other mapping interfaces, dma_map_sg can fail. When it does, 0 is returned and a driver must take appropriate action. It is critical that the driver do something, in the case of a block driver aborting the request or even oopsing is better than doing nothing and corrupting the filesystem. With scatterlists, you use the resulting mapping like this: int i, count = dma_map_sg(dev, sglist, nents, direction); struct scatterlist *sg; for_each_sg(sglist, sg, count, i) { hw_address[i] = sg_dma_address(sg); hw_len[i] = sg_dma_len(sg); } where nents is the number of entries in the sglist. The implementation is free to merge several consecutive sglist entries into one (e.g. with an IOMMU, or if several pages just happen to be physically contiguous) and returns the actual number of sg entries it mapped them to. On failure 0, is returned. Then you should loop count times (note: this can be less than nents times) and use sg_dma_address() and sg_dma_len() macros where you previously accessed sg->address and sg->length as shown above. void dma_unmap_sg(struct device *dev, struct scatterlist *sg, int nhwentries, enum dma_data_direction direction) Unmap the previously mapped scatter/gather list. All the parameters must be the same as those and passed in to the scatter/gather mapping API. Note: must be the number you passed in, *not* the number of physical entries returned. void dma_sync_single_for_cpu(struct device *dev, dma_addr_t dma_handle, size_t size, enum dma_data_direction direction) void dma_sync_single_for_device(struct device *dev, dma_addr_t dma_handle, size_t size, enum dma_data_direction direction) void dma_sync_sg_for_cpu(struct device *dev, struct scatterlist *sg, int nelems, enum dma_data_direction direction) void dma_sync_sg_for_device(struct device *dev, struct scatterlist *sg, int nelems, enum dma_data_direction direction) Synchronise a single contiguous or scatter/gather mapping for the cpu and device. With the sync_sg API, all the parameters must be the same as those passed into the single mapping API. With the sync_single API, you can use dma_handle and size parameters that aren't identical to those passed into the single mapping API to do a partial sync. Notes: You must do this: - Before reading values that have been written by DMA from the device (use the DMA_FROM_DEVICE direction) - After writing values that will be written to the device using DMA (use the DMA_TO_DEVICE) direction - before *and* after handing memory to the device if the memory is DMA_BIDIRECTIONAL See also dma_map_single(). dma_addr_t dma_map_single_attrs(struct device *dev, void *cpu_addr, size_t size, enum dma_data_direction dir, struct dma_attrs *attrs) void dma_unmap_single_attrs(struct device *dev, dma_addr_t dma_addr, size_t size, enum dma_data_direction dir, struct dma_attrs *attrs) int dma_map_sg_attrs(struct device *dev, struct scatterlist *sgl, int nents, enum dma_data_direction dir, struct dma_attrs *attrs) void dma_unmap_sg_attrs(struct device *dev, struct scatterlist *sgl, int nents, enum dma_data_direction dir, struct dma_attrs *attrs) The four functions above are just like the counterpart functions without the _attrs suffixes, except that they pass an optional struct dma_attrs*. struct dma_attrs encapsulates a set of "dma attributes". For the definition of struct dma_attrs see linux/dma-attrs.h. The interpretation of dma attributes is architecture-specific, and each attribute should be documented in Documentation/DMA-attributes.txt. If struct dma_attrs* is NULL, the semantics of each of these functions is identical to those of the corresponding function without the _attrs suffix. As a result dma_map_single_attrs() can generally replace dma_map_single(), etc. As an example of the use of the *_attrs functions, here's how you could pass an attribute DMA_ATTR_FOO when mapping memory for DMA: #include /* DMA_ATTR_FOO should be defined in linux/dma-attrs.h and * documented in Documentation/DMA-attributes.txt */ ... DEFINE_DMA_ATTRS(attrs); dma_set_attr(DMA_ATTR_FOO, &attrs); .... n = dma_map_sg_attrs(dev, sg, nents, DMA_TO_DEVICE, &attr); .... Architectures that care about DMA_ATTR_FOO would check for its presence in their implementations of the mapping and unmapping routines, e.g.: void whizco_dma_map_sg_attrs(struct device *dev, dma_addr_t dma_addr, size_t size, enum dma_data_direction dir, struct dma_attrs *attrs) { .... int foo = dma_get_attr(DMA_ATTR_FOO, attrs); .... if (foo) /* twizzle the frobnozzle */ .... Part II - Advanced dma_ usage ----------------------------- Warning: These pieces of the DMA API should not be used in the majority of cases, since they cater for unlikely corner cases that don't belong in usual drivers. If you don't understand how cache line coherency works between a processor and an I/O device, you should not be using this part of the API at all. void * dma_alloc_noncoherent(struct device *dev, size_t size, dma_addr_t *dma_handle, gfp_t flag) Identical to dma_alloc_coherent() except that the platform will choose to return either consistent or non-consistent memory as it sees fit. By using this API, you are guaranteeing to the platform that you have all the correct and necessary sync points for this memory in the driver should it choose to return non-consistent memory. Note: where the platform can return consistent memory, it will guarantee that the sync points become nops. Warning: Handling non-consistent memory is a real pain. You should only ever use this API if you positively know your driver will be required to work on one of the rare (usually non-PCI) architectures that simply cannot make consistent memory. void dma_free_noncoherent(struct device *dev, size_t size, void *cpu_addr, dma_addr_t dma_handle) Free memory allocated by the nonconsistent API. All parameters must be identical to those passed in (and returned by dma_alloc_noncoherent()). int dma_get_cache_alignment(void) Returns the processor cache alignment. This is the absolute minimum alignment *and* width that you must observe when either mapping memory or doing partial flushes. Notes: This API may return a number *larger* than the actual cache line, but it will guarantee that one or more cache lines fit exactly into the width returned by this call. It will also always be a power of two for easy alignment. void dma_cache_sync(struct device *dev, void *vaddr, size_t size, enum dma_data_direction direction) Do a partial sync of memory that was allocated by dma_alloc_noncoherent(), starting at virtual address vaddr and continuing on for size. Again, you *must* observe the cache line boundaries when doing this. int dma_declare_coherent_memory(struct device *dev, dma_addr_t bus_addr, dma_addr_t device_addr, size_t size, int flags) Declare region of memory to be handed out by dma_alloc_coherent when it's asked for coherent memory for this device. bus_addr is the physical address to which the memory is currently assigned in the bus responding region (this will be used by the platform to perform the mapping). device_addr is the physical address the device needs to be programmed with actually to address this memory (this will be handed out as the dma_addr_t in dma_alloc_coherent()). size is the size of the area (must be multiples of PAGE_SIZE). flags can be or'd together and are: DMA_MEMORY_MAP - request that the memory returned from dma_alloc_coherent() be directly writable. DMA_MEMORY_IO - request that the memory returned from dma_alloc_coherent() be addressable using read/write/memcpy_toio etc. One or both of these flags must be present. DMA_MEMORY_INCLUDES_CHILDREN - make the declared memory be allocated by dma_alloc_coherent of any child devices of this one (for memory residing on a bridge). DMA_MEMORY_EXCLUSIVE - only allocate memory from the declared regions. Do not allow dma_alloc_coherent() to fall back to system memory when it's out of memory in the declared region. The return value will be either DMA_MEMORY_MAP or DMA_MEMORY_IO and must correspond to a passed in flag (i.e. no returning DMA_MEMORY_IO if only DMA_MEMORY_MAP were passed in) for success or zero for failure. Note, for DMA_MEMORY_IO returns, all subsequent memory returned by dma_alloc_coherent() may no longer be accessed directly, but instead must be accessed using the correct bus functions. If your driver isn't prepared to handle this contingency, it should not specify DMA_MEMORY_IO in the input flags. As a simplification for the platforms, only *one* such region of memory may be declared per device. For reasons of efficiency, most platforms choose to track the declared region only at the granularity of a page. For smaller allocations, you should use the dma_pool() API. void dma_release_declared_memory(struct device *dev) Remove the memory region previously declared from the system. This API performs *no* in-use checking for this region and will return unconditionally having removed all the required structures. It is the driver's job to ensure that no parts of this memory region are currently in use. void * dma_mark_declared_memory_occupied(struct device *dev, dma_addr_t device_addr, size_t size) This is used to occupy specific regions of the declared space (dma_alloc_coherent() will hand out the first free region it finds). device_addr is the *device* address of the region requested. size is the size (and should be a page-sized multiple). The return value will be either a pointer to the processor virtual address of the memory, or an error (via PTR_ERR()) if any part of the region is occupied. Part III - Debug drivers use of the DMA-API ------------------------------------------- The DMA-API as described above as some constraints. DMA addresses must be released with the corresponding function with the same size for example. With the advent of hardware IOMMUs it becomes more and more important that drivers do not violate those constraints. In the worst case such a violation can result in data corruption up to destroyed filesystems. To debug drivers and find bugs in the usage of the DMA-API checking code can be compiled into the kernel which will tell the developer about those violations. If your architecture supports it you can select the "Enable debugging of DMA-API usage" option in your kernel configuration. Enabling this option has a performance impact. Do not enable it in production kernels. If you boot the resulting kernel will contain code which does some bookkeeping about what DMA memory was allocated for which device. If this code detects an error it prints a warning message with some details into your kernel log. An example warning message may look like this: ------------[ cut here ]------------ WARNING: at /data2/repos/linux-2.6-iommu/lib/dma-debug.c:448 check_unmap+0x203/0x490() Hardware name: forcedeth 0000:00:08.0: DMA-API: device driver frees DMA memory with wrong function [device address=0x00000000640444be] [size=66 bytes] [mapped as single] [unmapped as page] Modules linked in: nfsd exportfs bridge stp llc r8169 Pid: 0, comm: swapper Tainted: G W 2.6.28-dmatest-09289-g8bb99c0 #1 Call Trace: [] warn_slowpath+0xf2/0x130 [] _spin_unlock+0x10/0x30 [] usb_hcd_link_urb_to_ep+0x75/0xc0 [] _spin_unlock_irqrestore+0x12/0x40 [] ohci_urb_enqueue+0x19f/0x7c0 [] queue_work+0x56/0x60 [] enqueue_task_fair+0x20/0x50 [] usb_hcd_submit_urb+0x379/0xbc0 [] cpumask_next_and+0x23/0x40 [] find_busiest_group+0x207/0x8a0 [] _spin_lock_irqsave+0x1f/0x50 [] check_unmap+0x203/0x490 [] debug_dma_unmap_page+0x49/0x50 [] nv_tx_done_optimized+0xc6/0x2c0 [] nv_nic_irq_optimized+0x73/0x2b0 [] handle_IRQ_event+0x34/0x70 [] handle_edge_irq+0xc9/0x150 [] do_IRQ+0xcb/0x1c0 [] ret_from_intr+0x0/0xa <4>---[ end trace f6435a98e2a38c0e ]--- The driver developer can find the driver and the device including a stacktrace of the DMA-API call which caused this warning. Per default only the first error will result in a warning message. All other errors will only silently counted. This limitation exist to prevent the code from flooding your kernel log. To support debugging a device driver this can be disabled via debugfs. See the debugfs interface documentation below for details. The debugfs directory for the DMA-API debugging code is called dma-api/. In this directory the following files can currently be found: dma-api/all_errors This file contains a numeric value. If this value is not equal to zero the debugging code will print a warning for every error it finds into the kernel log. Be careful with this option, as it can easily flood your logs. dma-api/disabled This read-only file contains the character 'Y' if the debugging code is disabled. This can happen when it runs out of memory or if it was disabled at boot time dma-api/error_count This file is read-only and shows the total numbers of errors found. dma-api/num_errors The number in this file shows how many warnings will be printed to the kernel log before it stops. This number is initialized to one at system boot and be set by writing into this file dma-api/min_free_entries This read-only file can be read to get the minimum number of free dma_debug_entries the allocator has ever seen. If this value goes down to zero the code will disable itself because it is not longer reliable. dma-api/num_free_entries The current number of free dma_debug_entries in the allocator. dma-api/driver-filter You can write a name of a driver into this file to limit the debug output to requests from that particular driver. Write an empty string to that file to disable the filter and see all errors again. If you have this code compiled into your kernel it will be enabled by default. If you want to boot without the bookkeeping anyway you can provide 'dma_debug=off' as a boot parameter. This will disable DMA-API debugging. Notice that you can not enable it again at runtime. You have to reboot to do so. If you want to see debug messages only for a special device driver you can specify the dma_debug_driver= parameter. This will enable the driver filter at boot time. The debug code will only print errors for that driver afterwards. This filter can be disabled or changed later using debugfs. When the code disables itself at runtime this is most likely because it ran out of dma_debug_entries. These entries are preallocated at boot. The number of preallocated entries is defined per architecture. If it is too low for you boot with 'dma_debug_entries=' to overwrite the architectural default. Dynamic DMA mapping Guide ========================= David S. Miller Richard Henderson Jakub Jelinek This is a guide to device driver writers on how to use the DMA API with example pseudo-code. For a concise description of the API, see DMA-API.txt. Most of the 64bit platforms have special hardware that translates bus addresses (DMA addresses) into physical addresses. This is similar to how page tables and/or a TLB translates virtual addresses to physical addresses on a CPU. This is needed so that e.g. PCI devices can access with a Single Address Cycle (32bit DMA address) any page in the 64bit physical address space. Previously in Linux those 64bit platforms had to set artificial limits on the maximum RAM size in the system, so that the virt_to_bus() static scheme works (the DMA address translation tables were simply filled on bootup to map each bus address to the physical page __pa(bus_to_virt())). So that Linux can use the dynamic DMA mapping, it needs some help from the drivers, namely it has to take into account that DMA addresses should be mapped only for the time they are actually used and unmapped after the DMA transfer. The following API will work of course even on platforms where no such hardware exists. Note that the DMA API works with any bus independent of the underlying microprocessor architecture. You should use the DMA API rather than the bus specific DMA API (e.g. pci_dma_*). First of all, you should make sure #include is in your driver. This file will obtain for you the definition of the dma_addr_t (which can hold any valid DMA address for the platform) type which should be used everywhere you hold a DMA (bus) address returned from the DMA mapping functions. What memory is DMA'able? The first piece of information you must know is what kernel memory can be used with the DMA mapping facilities. There has been an unwritten set of rules regarding this, and this text is an attempt to finally write them down. If you acquired your memory via the page allocator (i.e. __get_free_page*()) or the generic memory allocators (i.e. kmalloc() or kmem_cache_alloc()) then you may DMA to/from that memory using the addresses returned from those routines. This means specifically that you may _not_ use the memory/addresses returned from vmalloc() for DMA. It is possible to DMA to the _underlying_ memory mapped into a vmalloc() area, but this requires walking page tables to get the physical addresses, and then translating each of those pages back to a kernel address using something like __va(). [ EDIT: Update this when we integrate Gerd Knorr's generic code which does this. ] This rule also means that you may use neither kernel image addresses (items in data/text/bss segments), nor module image addresses, nor stack addresses for DMA. These could all be mapped somewhere entirely different than the rest of physical memory. Even if those classes of memory could physically work with DMA, you'd need to ensure the I/O buffers were cacheline-aligned. Without that, you'd see cacheline sharing problems (data corruption) on CPUs with DMA-incoherent caches. (The CPU could write to one word, DMA would write to a different one in the same cache line, and one of them could be overwritten.) Also, this means that you cannot take the return of a kmap() call and DMA to/from that. This is similar to vmalloc(). What about block I/O and networking buffers? The block I/O and networking subsystems make sure that the buffers they use are valid for you to DMA from/to. DMA addressing limitations Does your device have any DMA addressing limitations? For example, is your device only capable of driving the low order 24-bits of address? If so, you need to inform the kernel of this fact. By default, the kernel assumes that your device can address the full 32-bits. For a 64-bit capable device, this needs to be increased. And for a device with limitations, as discussed in the previous paragraph, it needs to be decreased. Special note about PCI: PCI-X specification requires PCI-X devices to support 64-bit addressing (DAC) for all transactions. And at least one platform (SGI SN2) requires 64-bit consistent allocations to operate correctly when the IO bus is in PCI-X mode. For correct operation, you must interrogate the kernel in your device probe routine to see if the DMA controller on the machine can properly support the DMA addressing limitation your device has. It is good style to do this even if your device holds the default setting, because this shows that you did think about these issues wrt. your device. The query is performed via a call to dma_set_mask(): int dma_set_mask(struct device *dev, u64 mask); The query for consistent allocations is performed via a call to dma_set_coherent_mask(): int dma_set_coherent_mask(struct device *dev, u64 mask); Here, dev is a pointer to the device struct of your device, and mask is a bit mask describing which bits of an address your device supports. It returns zero if your card can perform DMA properly on the machine given the address mask you provided. In general, the device struct of your device is embedded in the bus specific device struct of your device. For example, a pointer to the device struct of your PCI device is pdev->dev (pdev is a pointer to the PCI device struct of your device). If it returns non-zero, your device cannot perform DMA properly on this platform, and attempting to do so will result in undefined behavior. You must either use a different mask, or not use DMA. This means that in the failure case, you have three options: 1) Use another DMA mask, if possible (see below). 2) Use some non-DMA mode for data transfer, if possible. 3) Ignore this device and do not initialize it. It is recommended that your driver print a kernel KERN_WARNING message when you end up performing either #2 or #3. In this manner, if a user of your driver reports that performance is bad or that the device is not even detected, you can ask them for the kernel messages to find out exactly why. The standard 32-bit addressing device would do something like this: if (dma_set_mask(dev, DMA_BIT_MASK(32))) { printk(KERN_WARNING "mydev: No suitable DMA available.\n"); goto ignore_this_device; } Another common scenario is a 64-bit capable device. The approach here is to try for 64-bit addressing, but back down to a 32-bit mask that should not fail. The kernel may fail the 64-bit mask not because the platform is not capable of 64-bit addressing. Rather, it may fail in this case simply because 32-bit addressing is done more efficiently than 64-bit addressing. For example, Sparc64 PCI SAC addressing is more efficient than DAC addressing. Here is how you would handle a 64-bit capable device which can drive all 64-bits when accessing streaming DMA: int using_dac; if (!dma_set_mask(dev, DMA_BIT_MASK(64))) { using_dac = 1; } else if (!dma_set_mask(dev, DMA_BIT_MASK(32))) { using_dac = 0; } else { printk(KERN_WARNING "mydev: No suitable DMA available.\n"); goto ignore_this_device; } If a card is capable of using 64-bit consistent allocations as well, the case would look like this: int using_dac, consistent_using_dac; if (!dma_set_mask(dev, DMA_BIT_MASK(64))) { using_dac = 1; consistent_using_dac = 1; dma_set_coherent_mask(dev, DMA_BIT_MASK(64)); } else if (!dma_set_mask(dev, DMA_BIT_MASK(32))) { using_dac = 0; consistent_using_dac = 0; dma_set_coherent_mask(dev, DMA_BIT_MASK(32)); } else { printk(KERN_WARNING "mydev: No suitable DMA available.\n"); goto ignore_this_device; } dma_set_coherent_mask() will always be able to set the same or a smaller mask as dma_set_mask(). However for the rare case that a device driver only uses consistent allocations, one would have to check the return value from dma_set_coherent_mask(). Finally, if your device can only drive the low 24-bits of address you might do something like: if (dma_set_mask(dev, DMA_BIT_MASK(24))) { printk(KERN_WARNING "mydev: 24-bit DMA addressing not available.\n"); goto ignore_this_device; } When dma_set_mask() is successful, and returns zero, the kernel saves away this mask you have provided. The kernel will use this information later when you make DMA mappings. There is a case which we are aware of at this time, which is worth mentioning in this documentation. If your device supports multiple functions (for example a sound card provides playback and record functions) and the various different functions have _different_ DMA addressing limitations, you may wish to probe each mask and only provide the functionality which the machine can handle. It is important that the last call to dma_set_mask() be for the most specific mask. Here is pseudo-code showing how this might be done: #define PLAYBACK_ADDRESS_BITS DMA_BIT_MASK(32) #define RECORD_ADDRESS_BITS DMA_BIT_MASK(24) struct my_sound_card *card; struct device *dev; ... if (!dma_set_mask(dev, PLAYBACK_ADDRESS_BITS)) { card->playback_enabled = 1; } else { card->playback_enabled = 0; printk(KERN_WARNING "%s: Playback disabled due to DMA limitations.\n", card->name); } if (!dma_set_mask(dev, RECORD_ADDRESS_BITS)) { card->record_enabled = 1; } else { card->record_enabled = 0; printk(KERN_WARNING "%s: Record disabled due to DMA limitations.\n", card->name); } A sound card was used as an example here because this genre of PCI devices seems to be littered with ISA chips given a PCI front end, and thus retaining the 16MB DMA addressing limitations of ISA. Types of DMA mappings There are two types of DMA mappings: - Consistent DMA mappings which are usually mapped at driver initialization, unmapped at the end and for which the hardware should guarantee that the device and the CPU can access the data in parallel and will see updates made by each other without any explicit software flushing. Think of "consistent" as "synchronous" or "coherent". The current default is to return consistent memory in the low 32 bits of the bus space. However, for future compatibility you should set the consistent mask even if this default is fine for your driver. Good examples of what to use consistent mappings for are: - Network card DMA ring descriptors. - SCSI adapter mailbox command data structures. - Device firmware microcode executed out of main memory. The invariant these examples all require is that any CPU store to memory is immediately visible to the device, and vice versa. Consistent mappings guarantee this. IMPORTANT: Consistent DMA memory does not preclude the usage of proper memory barriers. The CPU may reorder stores to consistent memory just as it may normal memory. Example: if it is important for the device to see the first word of a descriptor updated before the second, you must do something like: desc->word0 = address; wmb(); desc->word1 = DESC_VALID; in order to get correct behavior on all platforms. Also, on some platforms your driver may need to flush CPU write buffers in much the same way as it needs to flush write buffers found in PCI bridges (such as by reading a register's value after writing it). - Streaming DMA mappings which are usually mapped for one DMA transfer, unmapped right after it (unless you use dma_sync_* below) and for which hardware can optimize for sequential accesses. This of "streaming" as "asynchronous" or "outside the coherency domain". Good examples of what to use streaming mappings for are: - Networking buffers transmitted/received by a device. - Filesystem buffers written/read by a SCSI device. The interfaces for using this type of mapping were designed in such a way that an implementation can make whatever performance optimizations the hardware allows. To this end, when using such mappings you must be explicit about what you want to happen. Neither type of DMA mapping has alignment restrictions that come from the underlying bus, although some devices may have such restrictions. Also, systems with caches that aren't DMA-coherent will work better when the underlying buffers don't share cache lines with other data. Using Consistent DMA mappings. To allocate and map large (PAGE_SIZE or so) consistent DMA regions, you should do: dma_addr_t dma_handle; cpu_addr = dma_alloc_coherent(dev, size, &dma_handle, gfp); where device is a struct device *. This may be called in interrupt context with the GFP_ATOMIC flag. Size is the length of the region you want to allocate, in bytes. This routine will allocate RAM for that region, so it acts similarly to __get_free_pages (but takes size instead of a page order). If your driver needs regions sized smaller than a page, you may prefer using the dma_pool interface, described below. The consistent DMA mapping interfaces, for non-NULL dev, will by default return a DMA address which is 32-bit addressable. Even if the device indicates (via DMA mask) that it may address the upper 32-bits, consistent allocation will only return > 32-bit addresses for DMA if the consistent DMA mask has been explicitly changed via dma_set_coherent_mask(). This is true of the dma_pool interface as well. dma_alloc_coherent returns two values: the virtual address which you can use to access it from the CPU and dma_handle which you pass to the card. The cpu return address and the DMA bus master address are both guaranteed to be aligned to the smallest PAGE_SIZE order which is greater than or equal to the requested size. This invariant exists (for example) to guarantee that if you allocate a chunk which is smaller than or equal to 64 kilobytes, the extent of the buffer you receive will not cross a 64K boundary. To unmap and free such a DMA region, you call: dma_free_coherent(dev, size, cpu_addr, dma_handle); where dev, size are the same as in the above call and cpu_addr and dma_handle are the values dma_alloc_coherent returned to you. This function may not be called in interrupt context. If your driver needs lots of smaller memory regions, you can write custom code to subdivide pages returned by dma_alloc_coherent, or you can use the dma_pool API to do that. A dma_pool is like a kmem_cache, but it uses dma_alloc_coherent not __get_free_pages. Also, it understands common hardware constraints for alignment, like queue heads needing to be aligned on N byte boundaries. Create a dma_pool like this: struct dma_pool *pool; pool = dma_pool_create(name, dev, size, align, alloc); The "name" is for diagnostics (like a kmem_cache name); dev and size are as above. The device's hardware alignment requirement for this type of data is "align" (which is expressed in bytes, and must be a power of two). If your device has no boundary crossing restrictions, pass 0 for alloc; passing 4096 says memory allocated from this pool must not cross 4KByte boundaries (but at that time it may be better to go for dma_alloc_coherent directly instead). Allocate memory from a dma pool like this: cpu_addr = dma_pool_alloc(pool, flags, &dma_handle); flags are SLAB_KERNEL if blocking is permitted (not in_interrupt nor holding SMP locks), SLAB_ATOMIC otherwise. Like dma_alloc_coherent, this returns two values, cpu_addr and dma_handle. Free memory that was allocated from a dma_pool like this: dma_pool_free(pool, cpu_addr, dma_handle); where pool is what you passed to dma_pool_alloc, and cpu_addr and dma_handle are the values dma_pool_alloc returned. This function may be called in interrupt context. Destroy a dma_pool by calling: dma_pool_destroy(pool); Make sure you've called dma_pool_free for all memory allocated from a pool before you destroy the pool. This function may not be called in interrupt context. DMA Direction The interfaces described in subsequent portions of this document take a DMA direction argument, which is an integer and takes on one of the following values: DMA_BIDIRECTIONAL DMA_TO_DEVICE DMA_FROM_DEVICE DMA_NONE One should provide the exact DMA direction if you know it. DMA_TO_DEVICE means "from main memory to the device" DMA_FROM_DEVICE means "from the device to main memory" It is the direction in which the data moves during the DMA transfer. You are _strongly_ encouraged to specify this as precisely as you possibly can. If you absolutely cannot know the direction of the DMA transfer, specify DMA_BIDIRECTIONAL. It means that the DMA can go in either direction. The platform guarantees that you may legally specify this, and that it will work, but this may be at the cost of performance for example. The value DMA_NONE is to be used for debugging. One can hold this in a data structure before you come to know the precise direction, and this will help catch cases where your direction tracking logic has failed to set things up properly. Another advantage of specifying this value precisely (outside of potential platform-specific optimizations of such) is for debugging. Some platforms actually have a write permission boolean which DMA mappings can be marked with, much like page protections in the user program address space. Such platforms can and do report errors in the kernel logs when the DMA controller hardware detects violation of the permission setting. Only streaming mappings specify a direction, consistent mappings implicitly have a direction attribute setting of DMA_BIDIRECTIONAL. The SCSI subsystem tells you the direction to use in the 'sc_data_direction' member of the SCSI command your driver is working on. For Networking drivers, it's a rather simple affair. For transmit packets, map/unmap them with the DMA_TO_DEVICE direction specifier. For receive packets, just the opposite, map/unmap them with the DMA_FROM_DEVICE direction specifier. Using Streaming DMA mappings The streaming DMA mapping routines can be called from interrupt context. There are two versions of each map/unmap, one which will map/unmap a single memory region, and one which will map/unmap a scatterlist. To map a single region, you do: struct device *dev = &my_dev->dev; dma_addr_t dma_handle; void *addr = buffer->ptr; size_t size = buffer->len; dma_handle = dma_map_single(dev, addr, size, direction); and to unmap it: dma_unmap_single(dev, dma_handle, size, direction); You should call dma_unmap_single when the DMA activity is finished, e.g. from the interrupt which told you that the DMA transfer is done. Using cpu pointers like this for single mappings has a disadvantage, you cannot reference HIGHMEM memory in this way. Thus, there is a map/unmap interface pair akin to dma_{map,unmap}_single. These interfaces deal with page/offset pairs instead of cpu pointers. Specifically: struct device *dev = &my_dev->dev; dma_addr_t dma_handle; struct page *page = buffer->page; unsigned long offset = buffer->offset; size_t size = buffer->len; dma_handle = dma_map_page(dev, page, offset, size, direction); ... dma_unmap_page(dev, dma_handle, size, direction); Here, "offset" means byte offset within the given page. With scatterlists, you map a region gathered from several regions by: int i, count = dma_map_sg(dev, sglist, nents, direction); struct scatterlist *sg; for_each_sg(sglist, sg, count, i) { hw_address[i] = sg_dma_address(sg); hw_len[i] = sg_dma_len(sg); } where nents is the number of entries in the sglist. The implementation is free to merge several consecutive sglist entries into one (e.g. if DMA mapping is done with PAGE_SIZE granularity, any consecutive sglist entries can be merged into one provided the first one ends and the second one starts on a page boundary - in fact this is a huge advantage for cards which either cannot do scatter-gather or have very limited number of scatter-gather entries) and returns the actual number of sg entries it mapped them to. On failure 0 is returned. Then you should loop count times (note: this can be less than nents times) and use sg_dma_address() and sg_dma_len() macros where you previously accessed sg->address and sg->length as shown above. To unmap a scatterlist, just call: dma_unmap_sg(dev, sglist, nents, direction); Again, make sure DMA activity has already finished. PLEASE NOTE: The 'nents' argument to the dma_unmap_sg call must be the _same_ one you passed into the dma_map_sg call, it should _NOT_ be the 'count' value _returned_ from the dma_map_sg call. Every dma_map_{single,sg} call should have its dma_unmap_{single,sg} counterpart, because the bus address space is a shared resource (although in some ports the mapping is per each BUS so less devices contend for the same bus address space) and you could render the machine unusable by eating all bus addresses. If you need to use the same streaming DMA region multiple times and touch the data in between the DMA transfers, the buffer needs to be synced properly in order for the cpu and device to see the most uptodate and correct copy of the DMA buffer. So, firstly, just map it with dma_map_{single,sg}, and after each DMA transfer call either: dma_sync_single_for_cpu(dev, dma_handle, size, direction); or: dma_sync_sg_for_cpu(dev, sglist, nents, direction); as appropriate. Then, if you wish to let the device get at the DMA area again, finish accessing the data with the cpu, and then before actually giving the buffer to the hardware call either: dma_sync_single_for_device(dev, dma_handle, size, direction); or: dma_sync_sg_for_device(dev, sglist, nents, direction); as appropriate. After the last DMA transfer call one of the DMA unmap routines dma_unmap_{single,sg}. If you don't touch the data from the first dma_map_* call till dma_unmap_*, then you don't have to call the dma_sync_* routines at all. Here is pseudo code which shows a situation in which you would need to use the dma_sync_*() interfaces. my_card_setup_receive_buffer(struct my_card *cp, char *buffer, int len) { dma_addr_t mapping; mapping = dma_map_single(cp->dev, buffer, len, DMA_FROM_DEVICE); cp->rx_buf = buffer; cp->rx_len = len; cp->rx_dma = mapping; give_rx_buf_to_card(cp); } ... my_card_interrupt_handler(int irq, void *devid, struct pt_regs *regs) { struct my_card *cp = devid; ... if (read_card_status(cp) == RX_BUF_TRANSFERRED) { struct my_card_header *hp; /* Examine the header to see if we wish * to accept the data. But synchronize * the DMA transfer with the CPU first * so that we see updated contents. */ dma_sync_single_for_cpu(&cp->dev, cp->rx_dma, cp->rx_len, DMA_FROM_DEVICE); /* Now it is safe to examine the buffer. */ hp = (struct my_card_header *) cp->rx_buf; if (header_is_ok(hp)) { dma_unmap_single(&cp->dev, cp->rx_dma, cp->rx_len, DMA_FROM_DEVICE); pass_to_upper_layers(cp->rx_buf); make_and_setup_new_rx_buf(cp); } else { /* CPU should not write to * DMA_FROM_DEVICE-mapped area, * so dma_sync_single_for_device() is * not needed here. It would be required * for DMA_BIDIRECTIONAL mapping if * the memory was modified. */ give_rx_buf_to_card(cp); } } } Drivers converted fully to this interface should not use virt_to_bus any longer, nor should they use bus_to_virt. Some drivers have to be changed a little bit, because there is no longer an equivalent to bus_to_virt in the dynamic DMA mapping scheme - you have to always store the DMA addresses returned by the dma_alloc_coherent, dma_pool_alloc, and dma_map_single calls (dma_map_sg stores them in the scatterlist itself if the platform supports dynamic DMA mapping in hardware) in your driver structures and/or in the card registers. All drivers should be using these interfaces with no exceptions. It is planned to completely remove virt_to_bus() and bus_to_virt() as they are entirely deprecated. Some ports already do not provide these as it is impossible to correctly support them. Handling Errors DMA address space is limited on some architectures and an allocation failure can be determined by: - checking if dma_alloc_coherent returns NULL or dma_map_sg returns 0 - checking the returned dma_addr_t of dma_map_single and dma_map_page by using dma_mapping_error(): dma_addr_t dma_handle; dma_handle = dma_map_single(dev, addr, size, direction); if (dma_mapping_error(dev, dma_handle)) { /* * reduce current DMA mapping usage, * delay and try again later or * reset driver. */ } Networking drivers must call dev_kfree_skb to free the socket buffer and return NETDEV_TX_OK if the DMA mapping fails on the transmit hook (ndo_start_xmit). This means that the socket buffer is just dropped in the failure case. SCSI drivers must return SCSI_MLQUEUE_HOST_BUSY if the DMA mapping fails in the queuecommand hook. This means that the SCSI subsystem passes the command to the driver again later. Optimizing Unmap State Space Consumption On many platforms, dma_unmap_{single,page}() is simply a nop. Therefore, keeping track of the mapping address and length is a waste of space. Instead of filling your drivers up with ifdefs and the like to "work around" this (which would defeat the whole purpose of a portable API) the following facilities are provided. Actually, instead of describing the macros one by one, we'll transform some example code. 1) Use DEFINE_DMA_UNMAP_{ADDR,LEN} in state saving structures. Example, before: struct ring_state { struct sk_buff *skb; dma_addr_t mapping; __u32 len; }; after: struct ring_state { struct sk_buff *skb; DEFINE_DMA_UNMAP_ADDR(mapping); DEFINE_DMA_UNMAP_LEN(len); }; 2) Use dma_unmap_{addr,len}_set to set these values. Example, before: ringp->mapping = FOO; ringp->len = BAR; after: dma_unmap_addr_set(ringp, mapping, FOO); dma_unmap_len_set(ringp, len, BAR); 3) Use dma_unmap_{addr,len} to access these values. Example, before: dma_unmap_single(dev, ringp->mapping, ringp->len, DMA_FROM_DEVICE); after: dma_unmap_single(dev, dma_unmap_addr(ringp, mapping), dma_unmap_len(ringp, len), DMA_FROM_DEVICE); It really should be self-explanatory. We treat the ADDR and LEN separately, because it is possible for an implementation to only need the address in order to perform the unmap operation. Platform Issues If you are just writing drivers for Linux and do not maintain an architecture port for the kernel, you can safely skip down to "Closing". 1) Struct scatterlist requirements. Don't invent the architecture specific struct scatterlist; just use . You need to enable CONFIG_NEED_SG_DMA_LENGTH if the architecture supports IOMMUs (including software IOMMU). 2) ARCH_DMA_MINALIGN Architectures must ensure that kmalloc'ed buffer is DMA-safe. Drivers and subsystems depend on it. If an architecture isn't fully DMA-coherent (i.e. hardware doesn't ensure that data in the CPU cache is identical to data in main memory), ARCH_DMA_MINALIGN must be set so that the memory allocator makes sure that kmalloc'ed buffer doesn't share a cache line with the others. See arch/arm/include/asm/cache.h as an example. Note that ARCH_DMA_MINALIGN is about DMA memory alignment constraints. You don't need to worry about the architecture data alignment constraints (e.g. the alignment constraints about 64-bit objects). 3) Supporting multiple types of IOMMUs If your architecture needs to support multiple types of IOMMUs, you can use include/linux/asm-generic/dma-mapping-common.h. It's a library to support the DMA API with multiple types of IOMMUs. Lots of architectures (x86, powerpc, sh, alpha, ia64, microblaze and sparc) use it. Choose one to see how it can be used. If you need to support multiple types of IOMMUs in a single system, the example of x86 or powerpc helps. Closing This document, and the API itself, would not be in its current form without the feedback and suggestions from numerous individuals. We would like to specifically mention, in no particular order, the following people: Russell King Leo Dagum Ralf Baechle Grant Grundler Jay Estabrook Thomas Sailer Andrea Arcangeli Jens Axboe David Mosberger-Tang DMA attributes ============== This document describes the semantics of the DMA attributes that are defined in linux/dma-attrs.h. DMA_ATTR_WRITE_BARRIER ---------------------- DMA_ATTR_WRITE_BARRIER is a (write) barrier attribute for DMA. DMA to a memory region with the DMA_ATTR_WRITE_BARRIER attribute forces all pending DMA writes to complete, and thus provides a mechanism to strictly order DMA from a device across all intervening busses and bridges. This barrier is not specific to a particular type of interconnect, it applies to the system as a whole, and so its implementation must account for the idiosyncracies of the system all the way from the DMA device to memory. As an example of a situation where DMA_ATTR_WRITE_BARRIER would be useful, suppose that a device does a DMA write to indicate that data is ready and available in memory. The DMA of the "completion indication" could race with data DMA. Mapping the memory used for completion indications with DMA_ATTR_WRITE_BARRIER would prevent the race. DMA_ATTR_WEAK_ORDERING ---------------------- DMA_ATTR_WEAK_ORDERING specifies that reads and writes to the mapping may be weakly ordered, that is that reads and writes may pass each other. Since it is optional for platforms to implement DMA_ATTR_WEAK_ORDERING, those that do not will simply ignore the attribute and exhibit default behavior. DMA Buffer Sharing API Guide ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Sumit Semwal This document serves as a guide to device-driver writers on what is the dma-buf buffer sharing API, how to use it for exporting and using shared buffers. Any device driver which wishes to be a part of DMA buffer sharing, can do so as either the 'exporter' of buffers, or the 'user' of buffers. Say a driver A wants to use buffers created by driver B, then we call B as the exporter, and A as buffer-user. The exporter - implements and manages operations[1] for the buffer - allows other users to share the buffer by using dma_buf sharing APIs, - manages the details of buffer allocation, - decides about the actual backing storage where this allocation happens, - takes care of any migration of scatterlist - for all (shared) users of this buffer, The buffer-user - is one of (many) sharing users of the buffer. - doesn't need to worry about how the buffer is allocated, or where. - needs a mechanism to get access to the scatterlist that makes up this buffer in memory, mapped into its own address space, so it can access the same area of memory. *IMPORTANT*: [see https://lkml.org/lkml/2011/12/20/211 for more details] For this first version, A buffer shared using the dma_buf sharing API: - *may* be exported to user space using "mmap" *ONLY* by exporter, outside of this framework. - may be used *ONLY* by importers that do not need CPU access to the buffer. The dma_buf buffer sharing API usage contains the following steps: 1. Exporter announces that it wishes to export a buffer 2. Userspace gets the file descriptor associated with the exported buffer, and passes it around to potential buffer-users based on use case 3. Each buffer-user 'connects' itself to the buffer 4. When needed, buffer-user requests access to the buffer from exporter 5. When finished with its use, the buffer-user notifies end-of-DMA to exporter 6. when buffer-user is done using this buffer completely, it 'disconnects' itself from the buffer. 1. Exporter's announcement of buffer export The buffer exporter announces its wish to export a buffer. In this, it connects its own private buffer data, provides implementation for operations that can be performed on the exported dma_buf, and flags for the file associated with this buffer. Interface: struct dma_buf *dma_buf_export(void *priv, struct dma_buf_ops *ops, size_t size, int flags) If this succeeds, dma_buf_export allocates a dma_buf structure, and returns a pointer to the same. It also associates an anonymous file with this buffer, so it can be exported. On failure to allocate the dma_buf object, it returns NULL. 2. Userspace gets a handle to pass around to potential buffer-users Userspace entity requests for a file-descriptor (fd) which is a handle to the anonymous file associated with the buffer. It can then share the fd with other drivers and/or processes. Interface: int dma_buf_fd(struct dma_buf *dmabuf) This API installs an fd for the anonymous file associated with this buffer; returns either 'fd', or error. 3. Each buffer-user 'connects' itself to the buffer Each buffer-user now gets a reference to the buffer, using the fd passed to it. Interface: struct dma_buf *dma_buf_get(int fd) This API will return a reference to the dma_buf, and increment refcount for it. After this, the buffer-user needs to attach its device with the buffer, which helps the exporter to know of device buffer constraints. Interface: struct dma_buf_attachment *dma_buf_attach(struct dma_buf *dmabuf, struct device *dev) This API returns reference to an attachment structure, which is then used for scatterlist operations. It will optionally call the 'attach' dma_buf operation, if provided by the exporter. The dma-buf sharing framework does the bookkeeping bits related to managing the list of all attachments to a buffer. Until this stage, the buffer-exporter has the option to choose not to actually allocate the backing storage for this buffer, but wait for the first buffer-user to request use of buffer for allocation. 4. When needed, buffer-user requests access to the buffer Whenever a buffer-user wants to use the buffer for any DMA, it asks for access to the buffer using dma_buf_map_attachment API. At least one attach to the buffer must have happened before map_dma_buf can be called. Interface: struct sg_table * dma_buf_map_attachment(struct dma_buf_attachment *, enum dma_data_direction); This is a wrapper to dma_buf->ops->map_dma_buf operation, which hides the "dma_buf->ops->" indirection from the users of this interface. In struct dma_buf_ops, map_dma_buf is defined as struct sg_table * (*map_dma_buf)(struct dma_buf_attachment *, enum dma_data_direction); It is one of the buffer operations that must be implemented by the exporter. It should return the sg_table containing scatterlist for this buffer, mapped into caller's address space. If this is being called for the first time, the exporter can now choose to scan through the list of attachments for this buffer, collate the requirements of the attached devices, and choose an appropriate backing storage for the buffer. Based on enum dma_data_direction, it might be possible to have multiple users accessing at the same time (for reading, maybe), or any other kind of sharing that the exporter might wish to make available to buffer-users. map_dma_buf() operation can return -EINTR if it is interrupted by a signal. 5. When finished, the buffer-user notifies end-of-DMA to exporter Once the DMA for the current buffer-user is over, it signals 'end-of-DMA' to the exporter using the dma_buf_unmap_attachment API. Interface: void dma_buf_unmap_attachment(struct dma_buf_attachment *, struct sg_table *); This is a wrapper to dma_buf->ops->unmap_dma_buf() operation, which hides the "dma_buf->ops->" indirection from the users of this interface. In struct dma_buf_ops, unmap_dma_buf is defined as void (*unmap_dma_buf)(struct dma_buf_attachment *, struct sg_table *); unmap_dma_buf signifies the end-of-DMA for the attachment provided. Like map_dma_buf, this API also must be implemented by the exporter. 6. when buffer-user is done using this buffer, it 'disconnects' itself from the buffer. After the buffer-user has no more interest in using this buffer, it should disconnect itself from the buffer: - it first detaches itself from the buffer. Interface: void dma_buf_detach(struct dma_buf *dmabuf, struct dma_buf_attachment *dmabuf_attach); This API removes the attachment from the list in dmabuf, and optionally calls dma_buf->ops->detach(), if provided by exporter, for any housekeeping bits. - Then, the buffer-user returns the buffer reference to exporter. Interface: void dma_buf_put(struct dma_buf *dmabuf); This API then reduces the refcount for this buffer. If, as a result of this call, the refcount becomes 0, the 'release' file operation related to this fd is called. It calls the dmabuf->ops->release() operation in turn, and frees the memory allocated for dmabuf when exported. NOTES: - Importance of attach-detach and {map,unmap}_dma_buf operation pairs The attach-detach calls allow the exporter to figure out backing-storage constraints for the currently-interested devices. This allows preferential allocation, and/or migration of pages across different types of storage available, if possible. Bracketing of DMA access with {map,unmap}_dma_buf operations is essential to allow just-in-time backing of storage, and migration mid-way through a use-case. - Migration of backing storage if needed If after - at least one map_dma_buf has happened, - and the backing storage has been allocated for this buffer, another new buffer-user intends to attach itself to this buffer, it might be allowed, if possible for the exporter. In case it is allowed by the exporter: if the new buffer-user has stricter 'backing-storage constraints', and the exporter can handle these constraints, the exporter can just stall on the map_dma_buf until all outstanding access is completed (as signalled by unmap_dma_buf). Once all users have finished accessing and have unmapped this buffer, the exporter could potentially move the buffer to the stricter backing-storage, and then allow further {map,unmap}_dma_buf operations from any buffer-user from the migrated backing-storage. If the exporter cannot fulfil the backing-storage constraints of the new buffer-user device as requested, dma_buf_attach() would return an error to denote non-compatibility of the new buffer-sharing request with the current buffer. If the exporter chooses not to allow an attach() operation once a map_dma_buf() API has been called, it simply returns an error. Miscellaneous notes: - Any exporters or users of the dma-buf buffer sharing framework must have a 'select DMA_SHARED_BUFFER' in their respective Kconfigs. References: [1] struct dma_buf_ops in include/linux/dma-buf.h [2] All interfaces mentioned above defined in include/linux/dma-buf.h DMA Engine API Guide ==================== Vinod Koul NOTE: For DMA Engine usage in async_tx please see: Documentation/crypto/async-tx-api.txt Below is a guide to device driver writers on how to use the Slave-DMA API of the DMA Engine. This is applicable only for slave DMA usage only. The slave DMA usage consists of following steps: 1. Allocate a DMA slave channel 2. Set slave and controller specific parameters 3. Get a descriptor for transaction 4. Submit the transaction 5. Issue pending requests and wait for callback notification 1. Allocate a DMA slave channel Channel allocation is slightly different in the slave DMA context, client drivers typically need a channel from a particular DMA controller only and even in some cases a specific channel is desired. To request a channel dma_request_channel() API is used. Interface: struct dma_chan *dma_request_channel(dma_cap_mask_t mask, dma_filter_fn filter_fn, void *filter_param); where dma_filter_fn is defined as: typedef bool (*dma_filter_fn)(struct dma_chan *chan, void *filter_param); The 'filter_fn' parameter is optional, but highly recommended for slave and cyclic channels as they typically need to obtain a specific DMA channel. When the optional 'filter_fn' parameter is NULL, dma_request_channel() simply returns the first channel that satisfies the capability mask. Otherwise, the 'filter_fn' routine will be called once for each free channel which has a capability in 'mask'. 'filter_fn' is expected to return 'true' when the desired DMA channel is found. A channel allocated via this interface is exclusive to the caller, until dma_release_channel() is called. 2. Set slave and controller specific parameters Next step is always to pass some specific information to the DMA driver. Most of the generic information which a slave DMA can use is in struct dma_slave_config. This allows the clients to specify DMA direction, DMA addresses, bus widths, DMA burst lengths etc for the peripheral. If some DMA controllers have more parameters to be sent then they should try to embed struct dma_slave_config in their controller specific structure. That gives flexibility to client to pass more parameters, if required. Interface: int dmaengine_slave_config(struct dma_chan *chan, struct dma_slave_config *config) Please see the dma_slave_config structure definition in dmaengine.h for a detailed explaination of the struct members. Please note that the 'direction' member will be going away as it duplicates the direction given in the prepare call. 3. Get a descriptor for transaction For slave usage the various modes of slave transfers supported by the DMA-engine are: slave_sg - DMA a list of scatter gather buffers from/to a peripheral dma_cyclic - Perform a cyclic DMA operation from/to a peripheral till the operation is explicitly stopped. interleaved_dma - This is common to Slave as well as M2M clients. For slave address of devices' fifo could be already known to the driver. Various types of operations could be expressed by setting appropriate values to the 'dma_interleaved_template' members. A non-NULL return of this transfer API represents a "descriptor" for the given transaction. Interface: struct dma_async_tx_descriptor *(*chan->device->device_prep_slave_sg)( struct dma_chan *chan, struct scatterlist *sgl, unsigned int sg_len, enum dma_data_direction direction, unsigned long flags); struct dma_async_tx_descriptor *(*chan->device->device_prep_dma_cyclic)( struct dma_chan *chan, dma_addr_t buf_addr, size_t buf_len, size_t period_len, enum dma_data_direction direction); struct dma_async_tx_descriptor *(*device_prep_interleaved_dma)( struct dma_chan *chan, struct dma_interleaved_template *xt, unsigned long flags); The peripheral driver is expected to have mapped the scatterlist for the DMA operation prior to calling device_prep_slave_sg, and must keep the scatterlist mapped until the DMA operation has completed. The scatterlist must be mapped using the DMA struct device. So, normal setup should look like this: nr_sg = dma_map_sg(chan->device->dev, sgl, sg_len); if (nr_sg == 0) /* error */ desc = chan->device->device_prep_slave_sg(chan, sgl, nr_sg, direction, flags); Once a descriptor has been obtained, the callback information can be added and the descriptor must then be submitted. Some DMA engine drivers may hold a spinlock between a successful preparation and submission so it is important that these two operations are closely paired. Note: Although the async_tx API specifies that completion callback routines cannot submit any new operations, this is not the case for slave/cyclic DMA. For slave DMA, the subsequent transaction may not be available for submission prior to callback function being invoked, so slave DMA callbacks are permitted to prepare and submit a new transaction. For cyclic DMA, a callback function may wish to terminate the DMA via dmaengine_terminate_all(). Therefore, it is important that DMA engine drivers drop any locks before calling the callback function which may cause a deadlock. Note that callbacks will always be invoked from the DMA engines tasklet, never from interrupt context. 4. Submit the transaction Once the descriptor has been prepared and the callback information added, it must be placed on the DMA engine drivers pending queue. Interface: dma_cookie_t dmaengine_submit(struct dma_async_tx_descriptor *desc) This returns a cookie can be used to check the progress of DMA engine activity via other DMA engine calls not covered in this document. dmaengine_submit() will not start the DMA operation, it merely adds it to the pending queue. For this, see step 5, dma_async_issue_pending. 5. Issue pending DMA requests and wait for callback notification The transactions in the pending queue can be activated by calling the issue_pending API. If channel is idle then the first transaction in queue is started and subsequent ones queued up. On completion of each DMA operation, the next in queue is started and a tasklet triggered. The tasklet will then call the client driver completion callback routine for notification, if set. Interface: void dma_async_issue_pending(struct dma_chan *chan); Further APIs: 1. int dmaengine_terminate_all(struct dma_chan *chan) This causes all activity for the DMA channel to be stopped, and may discard data in the DMA FIFO which hasn't been fully transferred. No callback functions will be called for any incomplete transfers. 2. int dmaengine_pause(struct dma_chan *chan) This pauses activity on the DMA channel without data loss. 3. int dmaengine_resume(struct dma_chan *chan) Resume a previously paused DMA channel. It is invalid to resume a channel which is not currently paused. 4. enum dma_status dma_async_is_tx_complete(struct dma_chan *chan, dma_cookie_t cookie, dma_cookie_t *last, dma_cookie_t *used) This can be used to check the status of the channel. Please see the documentation in include/linux/dmaengine.h for a more complete description of this API. This can be used in conjunction with dma_async_is_complete() and the cookie returned from 'descriptor->submit()' to check for completion of a specific DMA transaction. Note: Not all DMA engine drivers can return reliable information for a running DMA channel. It is recommended that DMA engine users pause or stop (via dmaengine_terminate_all) the channel before using this API. DMA with ISA and LPC devices ============================ Pierre Ossman This document describes how to do DMA transfers using the old ISA DMA controller. Even though ISA is more or less dead today the LPC bus uses the same DMA system so it will be around for quite some time. Part I - Headers and dependencies --------------------------------- To do ISA style DMA you need to include two headers: #include #include The first is the generic DMA API used to convert virtual addresses to physical addresses (see Documentation/DMA-API.txt for details). The second contains the routines specific to ISA DMA transfers. Since this is not present on all platforms make sure you construct your Kconfig to be dependent on ISA_DMA_API (not ISA) so that nobody tries to build your driver on unsupported platforms. Part II - Buffer allocation --------------------------- The ISA DMA controller has some very strict requirements on which memory it can access so extra care must be taken when allocating buffers. (You usually need a special buffer for DMA transfers instead of transferring directly to and from your normal data structures.) The DMA-able address space is the lowest 16 MB of _physical_ memory. Also the transfer block may not cross page boundaries (which are 64 or 128 KiB depending on which channel you use). In order to allocate a piece of memory that satisfies all these requirements you pass the flag GFP_DMA to kmalloc. Unfortunately the memory available for ISA DMA is scarce so unless you allocate the memory during boot-up it's a good idea to also pass __GFP_REPEAT and __GFP_NOWARN to make the allocater try a bit harder. (This scarcity also means that you should allocate the buffer as early as possible and not release it until the driver is unloaded.) Part III - Address translation ------------------------------ To translate the virtual address to a physical use the normal DMA API. Do _not_ use isa_virt_to_phys() even though it does the same thing. The reason for this is that the function isa_virt_to_phys() will require a Kconfig dependency to ISA, not just ISA_DMA_API which is really all you need. Remember that even though the DMA controller has its origins in ISA it is used elsewhere. Note: x86_64 had a broken DMA API when it came to ISA but has since been fixed. If your arch has problems then fix the DMA API instead of reverting to the ISA functions. Part IV - Channels ------------------ A normal ISA DMA controller has 8 channels. The lower four are for 8-bit transfers and the upper four are for 16-bit transfers. (Actually the DMA controller is really two separate controllers where channel 4 is used to give DMA access for the second controller (0-3). This means that of the four 16-bits channels only three are usable.) You allocate these in a similar fashion as all basic resources: extern int request_dma(unsigned int dmanr, const char * device_id); extern void free_dma(unsigned int dmanr); The ability to use 16-bit or 8-bit transfers is _not_ up to you as a driver author but depends on what the hardware supports. Check your specs or test different channels. Part V - Transfer data ---------------------- Now for the good stuff, the actual DMA transfer. :) Before you use any ISA DMA routines you need to claim the DMA lock using claim_dma_lock(). The reason is that some DMA operations are not atomic so only one driver may fiddle with the registers at a time. The first time you use the DMA controller you should call clear_dma_ff(). This clears an internal register in the DMA controller that is used for the non-atomic operations. As long as you (and everyone else) uses the locking functions then you only need to reset this once. Next, you tell the controller in which direction you intend to do the transfer using set_dma_mode(). Currently you have the options DMA_MODE_READ and DMA_MODE_WRITE. Set the address from where the transfer should start (this needs to be 16-bit aligned for 16-bit transfers) and how many bytes to transfer. Note that it's _bytes_. The DMA routines will do all the required translation to values that the DMA controller understands. The final step is enabling the DMA channel and releasing the DMA lock. Once the DMA transfer is finished (or timed out) you should disable the channel again. You should also check get_dma_residue() to make sure that all data has been transferred. Example: int flags, residue; flags = claim_dma_lock(); clear_dma_ff(); set_dma_mode(channel, DMA_MODE_WRITE); set_dma_addr(channel, phys_addr); set_dma_count(channel, num_bytes); dma_enable(channel); release_dma_lock(flags); while (!device_done()); flags = claim_dma_lock(); dma_disable(channel); residue = dma_get_residue(channel); if (residue != 0) printk(KERN_ERR "driver: Incomplete DMA transfer!" " %d bytes left!\n", residue); release_dma_lock(flags); Part VI - Suspend/resume ------------------------ It is the driver's responsibility to make sure that the machine isn't suspended while a DMA transfer is in progress. Also, all DMA settings are lost when the system suspends so if your driver relies on the DMA controller being in a certain state then you have to restore these registers upon resume. Introduction ============ This document describes how to use the dynamic debug (ddebug) feature. Dynamic debug is designed to allow you to dynamically enable/disable kernel code to obtain additional kernel information. Currently, if CONFIG_DYNAMIC_DEBUG is set, then all pr_debug()/dev_dbg() calls can be dynamically enabled per-callsite. Dynamic debug has even more useful features: * Simple query language allows turning on and off debugging statements by matching any combination of: - source filename - function name - line number (including ranges of line numbers) - module name - format string * Provides a debugfs control file: /dynamic_debug/control which can be read to display the complete list of known debug statements, to help guide you Controlling dynamic debug Behaviour =================================== The behaviour of pr_debug()/dev_dbg()s are controlled via writing to a control file in the 'debugfs' filesystem. Thus, you must first mount the debugfs filesystem, in order to make use of this feature. Subsequently, we refer to the control file as: /dynamic_debug/control. For example, if you want to enable printing from source file 'svcsock.c', line 1603 you simply do: nullarbor:~ # echo 'file svcsock.c line 1603 +p' > /dynamic_debug/control If you make a mistake with the syntax, the write will fail thus: nullarbor:~ # echo 'file svcsock.c wtf 1 +p' > /dynamic_debug/control -bash: echo: write error: Invalid argument Viewing Dynamic Debug Behaviour =========================== You can view the currently configured behaviour of all the debug statements via: nullarbor:~ # cat /dynamic_debug/control # filename:lineno [module]function flags format /usr/src/packages/BUILD/sgi-enhancednfs-1.4/default/net/sunrpc/svc_rdma.c:323 [svcxprt_rdma]svc_rdma_cleanup - "SVCRDMA Module Removed, deregister RPC RDMA transport\012" /usr/src/packages/BUILD/sgi-enhancednfs-1.4/default/net/sunrpc/svc_rdma.c:341 [svcxprt_rdma]svc_rdma_init - "\011max_inline : %d\012" /usr/src/packages/BUILD/sgi-enhancednfs-1.4/default/net/sunrpc/svc_rdma.c:340 [svcxprt_rdma]svc_rdma_init - "\011sq_depth : %d\012" /usr/src/packages/BUILD/sgi-enhancednfs-1.4/default/net/sunrpc/svc_rdma.c:338 [svcxprt_rdma]svc_rdma_init - "\011max_requests : %d\012" ... You can also apply standard Unix text manipulation filters to this data, e.g. nullarbor:~ # grep -i rdma /dynamic_debug/control | wc -l 62 nullarbor:~ # grep -i tcp /dynamic_debug/control | wc -l 42 Note in particular that the third column shows the enabled behaviour flags for each debug statement callsite (see below for definitions of the flags). The default value, no extra behaviour enabled, is "-". So you can view all the debug statement callsites with any non-default flags: nullarbor:~ # awk '$3 != "-"' /dynamic_debug/control # filename:lineno [module]function flags format /usr/src/packages/BUILD/sgi-enhancednfs-1.4/default/net/sunrpc/svcsock.c:1603 [sunrpc]svc_send p "svc_process: st_sendto returned %d\012" Command Language Reference ========================== At the lexical level, a command comprises a sequence of words separated by whitespace characters. Note that newlines are treated as word separators and do *not* end a command or allow multiple commands to be done together. So these are all equivalent: nullarbor:~ # echo -c 'file svcsock.c line 1603 +p' > /dynamic_debug/control nullarbor:~ # echo -c ' file svcsock.c line 1603 +p ' > /dynamic_debug/control nullarbor:~ # echo -c 'file svcsock.c\nline 1603 +p' > /dynamic_debug/control nullarbor:~ # echo -n 'file svcsock.c line 1603 +p' > /dynamic_debug/control Commands are bounded by a write() system call. If you want to do multiple commands you need to do a separate "echo" for each, like: nullarbor:~ # echo 'file svcsock.c line 1603 +p' > /proc/dprintk ;\ > echo 'file svcsock.c line 1563 +p' > /proc/dprintk or even like: nullarbor:~ # ( > echo 'file svcsock.c line 1603 +p' ;\ > echo 'file svcsock.c line 1563 +p' ;\ > ) > /proc/dprintk At the syntactical level, a command comprises a sequence of match specifications, followed by a flags change specification. command ::= match-spec* flags-spec The match-spec's are used to choose a subset of the known dprintk() callsites to which to apply the flags-spec. Think of them as a query with implicit ANDs between each pair. Note that an empty list of match-specs is possible, but is not very useful because it will not match any debug statement callsites. A match specification comprises a keyword, which controls the attribute of the callsite to be compared, and a value to compare against. Possible keywords are: match-spec ::= 'func' string | 'file' string | 'module' string | 'format' string | 'line' line-range line-range ::= lineno | '-'lineno | lineno'-' | lineno'-'lineno // Note: line-range cannot contain space, e.g. // "1-30" is valid range but "1 - 30" is not. lineno ::= unsigned-int The meanings of each keyword are: func The given string is compared against the function name of each callsite. Example: func svc_tcp_accept file The given string is compared against either the full pathname or the basename of the source file of each callsite. Examples: file svcsock.c file /usr/src/packages/BUILD/sgi-enhancednfs-1.4/default/net/sunrpc/svcsock.c module The given string is compared against the module name of each callsite. The module name is the string as seen in "lsmod", i.e. without the directory or the .ko suffix and with '-' changed to '_'. Examples: module sunrpc module nfsd format The given string is searched for in the dynamic debug format string. Note that the string does not need to match the entire format, only some part. Whitespace and other special characters can be escaped using C octal character escape \ooo notation, e.g. the space character is \040. Alternatively, the string can be enclosed in double quote characters (") or single quote characters ('). Examples: format svcrdma: // many of the NFS/RDMA server dprintks format readahead // some dprintks in the readahead cache format nfsd:\040SETATTR // one way to match a format with whitespace format "nfsd: SETATTR" // a neater way to match a format with whitespace format 'nfsd: SETATTR' // yet another way to match a format with whitespace line The given line number or range of line numbers is compared against the line number of each dprintk() callsite. A single line number matches the callsite line number exactly. A range of line numbers matches any callsite between the first and last line number inclusive. An empty first number means the first line in the file, an empty line number means the last number in the file. Examples: line 1603 // exactly line 1603 line 1600-1605 // the six lines from line 1600 to line 1605 line -1605 // the 1605 lines from line 1 to line 1605 line 1600- // all lines from line 1600 to the end of the file The flags specification comprises a change operation followed by one or more flag characters. The change operation is one of the characters: - remove the given flags + add the given flags = set the flags to the given flags The flags are: f Include the function name in the printed message l Include line number in the printed message m Include module name in the printed message p Causes a printk() message to be emitted to dmesg t Include thread ID in messages not generated from interrupt context Note the regexp ^[-+=][flmpt]+$ matches a flags specification. Note also that there is no convenient syntax to remove all the flags at once, you need to use "-flmpt". Debug messages during boot process ================================== To be able to activate debug messages during the boot process, even before userspace and debugfs exists, use the boot parameter: ddebug_query="QUERY" QUERY follows the syntax described above, but must not exceed 1023 characters. The enablement of debug messages is done as an arch_initcall. Thus you can enable debug messages in all code processed after this arch_initcall via this boot parameter. On an x86 system for example ACPI enablement is a subsys_initcall and ddebug_query="file ec.c +p" will show early Embedded Controller transactions during ACPI setup if your machine (typically a laptop) has an Embedded Controller. PCI (or other devices) initialization also is a hot candidate for using this boot parameter for debugging purposes. Examples ======== // enable the message at line 1603 of file svcsock.c nullarbor:~ # echo -n 'file svcsock.c line 1603 +p' > /dynamic_debug/control // enable all the messages in file svcsock.c nullarbor:~ # echo -n 'file svcsock.c +p' > /dynamic_debug/control // enable all the messages in the NFS server module nullarbor:~ # echo -n 'module nfsd +p' > /dynamic_debug/control // enable all 12 messages in the function svc_process() nullarbor:~ # echo -n 'func svc_process +p' > /dynamic_debug/control // disable all 12 messages in the function svc_process() nullarbor:~ # echo -n 'func svc_process -p' > /dynamic_debug/control // enable messages for NFS calls READ, READLINK, READDIR and READDIR+. nullarbor:~ # echo -n 'format "nfsd: READ" +p' > /dynamic_debug/control EDAC - Error Detection And Correction Written by Doug Thompson 7 Dec 2005 17 Jul 2007 Updated (c) Mauro Carvalho Chehab 05 Aug 2009 Nehalem interface EDAC is maintained and written by: Doug Thompson, Dave Jiang, Dave Peterson et al, original author: Thayne Harbaugh, Contact: website: bluesmoke.sourceforge.net mailing list: bluesmoke-devel@lists.sourceforge.net "bluesmoke" was the name for this device driver when it was "out-of-tree" and maintained at sourceforge.net. When it was pushed into 2.6.16 for the first time, it was renamed to 'EDAC'. The bluesmoke project at sourceforge.net is now utilized as a 'staging area' for EDAC development, before it is sent upstream to kernel.org At the bluesmoke/EDAC project site is a series of quilt patches against recent kernels, stored in a SVN repository. For easier downloading, there is also a tarball snapshot available. ============================================================================ EDAC PURPOSE The 'edac' kernel module goal is to detect and report errors that occur within the computer system running under linux. MEMORY In the initial release, memory Correctable Errors (CE) and Uncorrectable Errors (UE) are the primary errors being harvested. These types of errors are harvested by the 'edac_mc' class of device. Detecting CE events, then harvesting those events and reporting them, CAN be a predictor of future UE events. With CE events, the system can continue to operate, but with less safety. Preventive maintenance and proactive part replacement of memory DIMMs exhibiting CEs can reduce the likelihood of the dreaded UE events and system 'panics'. NON-MEMORY A new feature for EDAC, the edac_device class of device, was added in the 2.6.23 version of the kernel. This new device type allows for non-memory type of ECC hardware detectors to have their states harvested and presented to userspace via the sysfs interface. Some architectures have ECC detectors for L1, L2 and L3 caches, along with DMA engines, fabric switches, main data path switches, interconnections, and various other hardware data paths. If the hardware reports it, then a edac_device device probably can be constructed to harvest and present that to userspace. PCI BUS SCANNING In addition, PCI Bus Parity and SERR Errors are scanned for on PCI devices in order to determine if errors are occurring on data transfers. The presence of PCI Parity errors must be examined with a grain of salt. There are several add-in adapters that do NOT follow the PCI specification with regards to Parity generation and reporting. The specification says the vendor should tie the parity status bits to 0 if they do not intend to generate parity. Some vendors do not do this, and thus the parity bit can "float" giving false positives. In the kernel there is a PCI device attribute located in sysfs that is checked by the EDAC PCI scanning code. If that attribute is set, PCI parity/error scanning is skipped for that device. The attribute is: broken_parity_status as is located in /sys/devices/pci/0000:XX:YY.Z directories for PCI devices. FUTURE HARDWARE SCANNING EDAC will have future error detectors that will be integrated with EDAC or added to it, in the following list: MCE Machine Check Exception MCA Machine Check Architecture NMI NMI notification of ECC errors MSRs Machine Specific Register error cases and other mechanisms. These errors are usually bus errors, ECC errors, thermal throttling and the like. ============================================================================ EDAC VERSIONING EDAC is composed of a "core" module (edac_core.ko) and several Memory Controller (MC) driver modules. On a given system, the CORE is loaded and one MC driver will be loaded. Both the CORE and the MC driver (or edac_device driver) have individual versions that reflect current release level of their respective modules. Thus, to "report" on what version a system is running, one must report both the CORE's and the MC driver's versions. LOADING If 'edac' was statically linked with the kernel then no loading is necessary. If 'edac' was built as modules then simply modprobe the 'edac' pieces that you need. You should be able to modprobe hardware-specific modules and have the dependencies load the necessary core modules. Example: $> modprobe amd76x_edac loads both the amd76x_edac.ko memory controller module and the edac_mc.ko core module. ============================================================================ EDAC sysfs INTERFACE EDAC presents a 'sysfs' interface for control, reporting and attribute reporting purposes. EDAC lives in the /sys/devices/system/edac directory. Within this directory there currently reside 2 'edac' components: mc memory controller(s) system pci PCI control and status system ============================================================================ Memory Controller (mc) Model First a background on the memory controller's model abstracted in EDAC. Each 'mc' device controls a set of DIMM memory modules. These modules are laid out in a Chip-Select Row (csrowX) and Channel table (chX). There can be multiple csrows and multiple channels. Memory controllers allow for several csrows, with 8 csrows being a typical value. Yet, the actual number of csrows depends on the electrical "loading" of a given motherboard, memory controller and DIMM characteristics. Dual channels allows for 128 bit data transfers to the CPU from memory. Some newer chipsets allow for more than 2 channels, like Fully Buffered DIMMs (FB-DIMMs). The following example will assume 2 channels: Channel 0 Channel 1 =================================== csrow0 | DIMM_A0 | DIMM_B0 | csrow1 | DIMM_A0 | DIMM_B0 | =================================== =================================== csrow2 | DIMM_A1 | DIMM_B1 | csrow3 | DIMM_A1 | DIMM_B1 | =================================== In the above example table there are 4 physical slots on the motherboard for memory DIMMs: DIMM_A0 DIMM_B0 DIMM_A1 DIMM_B1 Labels for these slots are usually silk screened on the motherboard. Slots labeled 'A' are channel 0 in this example. Slots labeled 'B' are channel 1. Notice that there are two csrows possible on a physical DIMM. These csrows are allocated their csrow assignment based on the slot into which the memory DIMM is placed. Thus, when 1 DIMM is placed in each Channel, the csrows cross both DIMMs. Memory DIMMs come single or dual "ranked". A rank is a populated csrow. Thus, 2 single ranked DIMMs, placed in slots DIMM_A0 and DIMM_B0 above will have 1 csrow, csrow0. csrow1 will be empty. On the other hand, when 2 dual ranked DIMMs are similarly placed, then both csrow0 and csrow1 will be populated. The pattern repeats itself for csrow2 and csrow3. The representation of the above is reflected in the directory tree in EDAC's sysfs interface. Starting in directory /sys/devices/system/edac/mc each memory controller will be represented by its own 'mcX' directory, where 'X' is the index of the MC. ..../edac/mc/ | |->mc0 |->mc1 |->mc2 .... Under each 'mcX' directory each 'csrowX' is again represented by a 'csrowX', where 'X' is the csrow index: .../mc/mc0/ | |->csrow0 |->csrow2 |->csrow3 .... Notice that there is no csrow1, which indicates that csrow0 is composed of a single ranked DIMMs. This should also apply in both Channels, in order to have dual-channel mode be operational. Since both csrow2 and csrow3 are populated, this indicates a dual ranked set of DIMMs for channels 0 and 1. Within each of the 'mcX' and 'csrowX' directories are several EDAC control and attribute files. ============================================================================ 'mcX' DIRECTORIES In 'mcX' directories are EDAC control and attribute files for this 'X' instance of the memory controllers: Counter reset control file: 'reset_counters' This write-only control file will zero all the statistical counters for UE and CE errors. Zeroing the counters will also reset the timer indicating how long since the last counter zero. This is useful for computing errors/time. Since the counters are always reset at driver initialization time, no module/kernel parameter is available. RUN TIME: echo "anything" >/sys/devices/system/edac/mc/mc0/counter_reset This resets the counters on memory controller 0 Seconds since last counter reset control file: 'seconds_since_reset' This attribute file displays how many seconds have elapsed since the last counter reset. This can be used with the error counters to measure error rates. Memory Controller name attribute file: 'mc_name' This attribute file displays the type of memory controller that is being utilized. Total memory managed by this memory controller attribute file: 'size_mb' This attribute file displays, in count of megabytes, of memory that this instance of memory controller manages. Total Uncorrectable Errors count attribute file: 'ue_count' This attribute file displays the total count of uncorrectable errors that have occurred on this memory controller. If panic_on_ue is set this counter will not have a chance to increment, since EDAC will panic the system. Total UE count that had no information attribute fileY: 'ue_noinfo_count' This attribute file displays the number of UEs that have occurred with no information as to which DIMM slot is having errors. Total Correctable Errors count attribute file: 'ce_count' This attribute file displays the total count of correctable errors that have occurred on this memory controller. This count is very important to examine. CEs provide early indications that a DIMM is beginning to fail. This count field should be monitored for non-zero values and report such information to the system administrator. Total Correctable Errors count attribute file: 'ce_noinfo_count' This attribute file displays the number of CEs that have occurred wherewith no information as to which DIMM slot is having errors. Memory is handicapped, but operational, yet no information is available to indicate which slot the failing memory is in. This count field should be also be monitored for non-zero values. Device Symlink: 'device' Symlink to the memory controller device. Sdram memory scrubbing rate: 'sdram_scrub_rate' Read/Write attribute file that controls memory scrubbing. The scrubbing rate is set by writing a minimum bandwidth in bytes/sec to the attribute file. The rate will be translated to an internal value that gives at least the specified rate. Reading the file will return the actual scrubbing rate employed. If configuration fails or memory scrubbing is not implemented, the value of the attribute file will be -1. ============================================================================ 'csrowX' DIRECTORIES In the 'csrowX' directories are EDAC control and attribute files for this 'X' instance of csrow: Total Uncorrectable Errors count attribute file: 'ue_count' This attribute file displays the total count of uncorrectable errors that have occurred on this csrow. If panic_on_ue is set this counter will not have a chance to increment, since EDAC will panic the system. Total Correctable Errors count attribute file: 'ce_count' This attribute file displays the total count of correctable errors that have occurred on this csrow. This count is very important to examine. CEs provide early indications that a DIMM is beginning to fail. This count field should be monitored for non-zero values and report such information to the system administrator. Total memory managed by this csrow attribute file: 'size_mb' This attribute file displays, in count of megabytes, of memory that this csrow contains. Memory Type attribute file: 'mem_type' This attribute file will display what type of memory is currently on this csrow. Normally, either buffered or unbuffered memory. Examples: Registered-DDR Unbuffered-DDR EDAC Mode of operation attribute file: 'edac_mode' This attribute file will display what type of Error detection and correction is being utilized. Device type attribute file: 'dev_type' This attribute file will display what type of DRAM device is being utilized on this DIMM. Examples: x1 x2 x4 x8 Channel 0 CE Count attribute file: 'ch0_ce_count' This attribute file will display the count of CEs on this DIMM located in channel 0. Channel 0 UE Count attribute file: 'ch0_ue_count' This attribute file will display the count of UEs on this DIMM located in channel 0. Channel 0 DIMM Label control file: 'ch0_dimm_label' This control file allows this DIMM to have a label assigned to it. With this label in the module, when errors occur the output can provide the DIMM label in the system log. This becomes vital for panic events to isolate the cause of the UE event. DIMM Labels must be assigned after booting, with information that correctly identifies the physical slot with its silk screen label. This information is currently very motherboard specific and determination of this information must occur in userland at this time. Channel 1 CE Count attribute file: 'ch1_ce_count' This attribute file will display the count of CEs on this DIMM located in channel 1. Channel 1 UE Count attribute file: 'ch1_ue_count' This attribute file will display the count of UEs on this DIMM located in channel 0. Channel 1 DIMM Label control file: 'ch1_dimm_label' This control file allows this DIMM to have a label assigned to it. With this label in the module, when errors occur the output can provide the DIMM label in the system log. This becomes vital for panic events to isolate the cause of the UE event. DIMM Labels must be assigned after booting, with information that correctly identifies the physical slot with its silk screen label. This information is currently very motherboard specific and determination of this information must occur in userland at this time. ============================================================================ SYSTEM LOGGING If logging for UEs and CEs are enabled then system logs will have error notices indicating errors that have been detected: EDAC MC0: CE page 0x283, offset 0xce0, grain 8, syndrome 0x6ec3, row 0, channel 1 "DIMM_B1": amd76x_edac EDAC MC0: CE page 0x1e5, offset 0xfb0, grain 8, syndrome 0xb741, row 0, channel 1 "DIMM_B1": amd76x_edac The structure of the message is: the memory controller (MC0) Error type (CE) memory page (0x283) offset in the page (0xce0) the byte granularity (grain 8) or resolution of the error the error syndrome (0xb741) memory row (row 0) memory channel (channel 1) DIMM label, if set prior (DIMM B1 and then an optional, driver-specific message that may have additional information. Both UEs and CEs with no info will lack all but memory controller, error type, a notice of "no info" and then an optional, driver-specific error message. ============================================================================ PCI Bus Parity Detection On Header Type 00 devices the primary status is looked at for any parity error regardless of whether Parity is enabled on the device. (The spec indicates parity is generated in some cases). On Header Type 01 bridges, the secondary status register is also looked at to see if parity occurred on the bus on the other side of the bridge. SYSFS CONFIGURATION Under /sys/devices/system/edac/pci are control and attribute files as follows: Enable/Disable PCI Parity checking control file: 'check_pci_parity' This control file enables or disables the PCI Bus Parity scanning operation. Writing a 1 to this file enables the scanning. Writing a 0 to this file disables the scanning. Enable: echo "1" >/sys/devices/system/edac/pci/check_pci_parity Disable: echo "0" >/sys/devices/system/edac/pci/check_pci_parity Parity Count: 'pci_parity_count' This attribute file will display the number of parity errors that have been detected. ============================================================================ MODULE PARAMETERS Panic on UE control file: 'edac_mc_panic_on_ue' An uncorrectable error will cause a machine panic. This is usually desirable. It is a bad idea to continue when an uncorrectable error occurs - it is indeterminate what was uncorrected and the operating system context might be so mangled that continuing will lead to further corruption. If the kernel has MCE configured, then EDAC will never notice the UE. LOAD TIME: module/kernel parameter: edac_mc_panic_on_ue=[0|1] RUN TIME: echo "1" > /sys/module/edac_core/parameters/edac_mc_panic_on_ue Log UE control file: 'edac_mc_log_ue' Generate kernel messages describing uncorrectable errors. These errors are reported through the system message log system. UE statistics will be accumulated even when UE logging is disabled. LOAD TIME: module/kernel parameter: edac_mc_log_ue=[0|1] RUN TIME: echo "1" > /sys/module/edac_core/parameters/edac_mc_log_ue Log CE control file: 'edac_mc_log_ce' Generate kernel messages describing correctable errors. These errors are reported through the system message log system. CE statistics will be accumulated even when CE logging is disabled. LOAD TIME: module/kernel parameter: edac_mc_log_ce=[0|1] RUN TIME: echo "1" > /sys/module/edac_core/parameters/edac_mc_log_ce Polling period control file: 'edac_mc_poll_msec' The time period, in milliseconds, for polling for error information. Too small a value wastes resources. Too large a value might delay necessary handling of errors and might loose valuable information for locating the error. 1000 milliseconds (once each second) is the current default. Systems which require all the bandwidth they can get, may increase this. LOAD TIME: module/kernel parameter: edac_mc_poll_msec=[0|1] RUN TIME: echo "1000" > /sys/module/edac_core/parameters/edac_mc_poll_msec Panic on PCI PARITY Error: 'panic_on_pci_parity' This control files enables or disables panicking when a parity error has been detected. module/kernel parameter: edac_panic_on_pci_pe=[0|1] Enable: echo "1" > /sys/module/edac_core/parameters/edac_panic_on_pci_pe Disable: echo "0" > /sys/module/edac_core/parameters/edac_panic_on_pci_pe ======================================================================= EDAC_DEVICE type of device In the header file, edac_core.h, there is a series of edac_device structures and APIs for the EDAC_DEVICE. User space access to an edac_device is through the sysfs interface. At the location /sys/devices/system/edac (sysfs) new edac_device devices will appear. There is a three level tree beneath the above 'edac' directory. For example, the 'test_device_edac' device (found at the bluesmoke.sourceforget.net website) installs itself as: /sys/devices/systm/edac/test-instance in this directory are various controls, a symlink and one or more 'instance' directorys. The standard default controls are: log_ce boolean to log CE events log_ue boolean to log UE events panic_on_ue boolean to 'panic' the system if an UE is encountered (default off, can be set true via startup script) poll_msec time period between POLL cycles for events The test_device_edac device adds at least one of its own custom control: test_bits which in the current test driver does nothing but show how it is installed. A ported driver can add one or more such controls and/or attributes for specific uses. One out-of-tree driver uses controls here to allow for ERROR INJECTION operations to hardware injection registers The symlink points to the 'struct dev' that is registered for this edac_device. INSTANCES One or more instance directories are present. For the 'test_device_edac' case: test-instance0 In this directory there are two default counter attributes, which are totals of counter in deeper subdirectories. ce_count total of CE events of subdirectories ue_count total of UE events of subdirectories BLOCKS At the lowest directory level is the 'block' directory. There can be 0, 1 or more blocks specified in each instance. test-block0 In this directory the default attributes are: ce_count which is counter of CE events for this 'block' of hardware being monitored ue_count which is counter of UE events for this 'block' of hardware being monitored The 'test_device_edac' device adds 4 attributes and 1 control: test-block-bits-0 for every POLL cycle this counter is incremented test-block-bits-1 every 10 cycles, this counter is bumped once, and test-block-bits-0 is set to 0 test-block-bits-2 every 100 cycles, this counter is bumped once, and test-block-bits-1 is set to 0 test-block-bits-3 every 1000 cycles, this counter is bumped once, and test-block-bits-2 is set to 0 reset-counters writing ANY thing to this control will reset all the above counters. Use of the 'test_device_edac' driver should any others to create their own unique drivers for their hardware systems. The 'test_device_edac' sample driver is located at the bluesmoke.sourceforge.net project site for EDAC. ======================================================================= NEHALEM USAGE OF EDAC APIs This chapter documents some EXPERIMENTAL mappings for EDAC API to handle Nehalem EDAC driver. They will likely be changed on future versions of the driver. Due to the way Nehalem exports Memory Controller data, some adjustments were done at i7core_edac driver. This chapter will cover those differences 1) On Nehalem, there are one Memory Controller per Quick Patch Interconnect (QPI). At the driver, the term "socket" means one QPI. This is associated with a physical CPU socket. Each MC have 3 physical read channels, 3 physical write channels and 3 logic channels. The driver currenty sees it as just 3 channels. Each channel can have up to 3 DIMMs. The minimum known unity is DIMMs. There are no information about csrows. As EDAC API maps the minimum unity is csrows, the driver sequencially maps channel/dimm into different csrows. For example, supposing the following layout: Ch0 phy rd0, wr0 (0x063f4031): 2 ranks, UDIMMs dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400 dimm 1 1024 Mb offset: 4, bank: 8, rank: 1, row: 0x4000, col: 0x400 Ch1 phy rd1, wr1 (0x063f4031): 2 ranks, UDIMMs dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400 Ch2 phy rd3, wr3 (0x063f4031): 2 ranks, UDIMMs dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400 The driver will map it as: csrow0: channel 0, dimm0 csrow1: channel 0, dimm1 csrow2: channel 1, dimm0 csrow3: channel 2, dimm0 exports one DIMM per csrow. Each QPI is exported as a different memory controller. 2) Nehalem MC has the hability to generate errors. The driver implements this functionality via some error injection nodes: For injecting a memory error, there are some sysfs nodes, under /sys/devices/system/edac/mc/mc?/: inject_addrmatch/*: Controls the error injection mask register. It is possible to specify several characteristics of the address to match an error code: dimm = the affected dimm. Numbers are relative to a channel; rank = the memory rank; channel = the channel that will generate an error; bank = the affected bank; page = the page address; column (or col) = the address column. each of the above values can be set to "any" to match any valid value. At driver init, all values are set to any. For example, to generate an error at rank 1 of dimm 2, for any channel, any bank, any page, any column: echo 2 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/dimm echo 1 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/rank To return to the default behaviour of matching any, you can do: echo any >/sys/devices/system/edac/mc/mc0/inject_addrmatch/dimm echo any >/sys/devices/system/edac/mc/mc0/inject_addrmatch/rank inject_eccmask: specifies what bits will have troubles, inject_section: specifies what ECC cache section will get the error: 3 for both 2 for the highest 1 for the lowest inject_type: specifies the type of error, being a combination of the following bits: bit 0 - repeat bit 1 - ecc bit 2 - parity inject_enable starts the error generation when something different than 0 is written. All inject vars can be read. root permission is needed for write. Datasheet states that the error will only be generated after a write on an address that matches inject_addrmatch. It seems, however, that reading will also produce an error. For example, the following code will generate an error for any write access at socket 0, on any DIMM/address on channel 2: echo 2 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/channel echo 2 >/sys/devices/system/edac/mc/mc0/inject_type echo 64 >/sys/devices/system/edac/mc/mc0/inject_eccmask echo 3 >/sys/devices/system/edac/mc/mc0/inject_section echo 1 >/sys/devices/system/edac/mc/mc0/inject_enable dd if=/dev/mem of=/dev/null seek=16k bs=4k count=1 >& /dev/null For socket 1, it is needed to replace "mc0" by "mc1" at the above commands. The generated error message will look like: EDAC MC0: UE row 0, channel-a= 0 channel-b= 0 labels "-": NON_FATAL (addr = 0x0075b980, socket=0, Dimm=0, Channel=2, syndrome=0x00000040, count=1, Err=8c0000400001009f:4000080482 (read error: read ECC error)) 3) Nehalem specific Corrected Error memory counters Nehalem have some registers to count memory errors. The driver uses those registers to report Corrected Errors on devices with Registered Dimms. However, those counters don't work with Unregistered Dimms. As the chipset offers some counters that also work with UDIMMS (but with a worse level of granularity than the default ones), the driver exposes those registers for UDIMM memories. They can be read by looking at the contents of all_channel_counts/ $ for i in /sys/devices/system/edac/mc/mc0/all_channel_counts/*; do echo $i; cat $i; done /sys/devices/system/edac/mc/mc0/all_channel_counts/udimm0 0 /sys/devices/system/edac/mc/mc0/all_channel_counts/udimm1 0 /sys/devices/system/edac/mc/mc0/all_channel_counts/udimm2 0 What happens here is that errors on different csrows, but at the same dimm number will increment the same counter. So, in this memory mapping: csrow0: channel 0, dimm0 csrow1: channel 0, dimm1 csrow2: channel 1, dimm0 csrow3: channel 2, dimm0 The hardware will increment udimm0 for an error at the first dimm at either csrow0, csrow2 or csrow3; The hardware will increment udimm1 for an error at the second dimm at either csrow0, csrow2 or csrow3; The hardware will increment udimm2 for an error at the third dimm at either csrow0, csrow2 or csrow3; 4) Standard error counters The standard error counters are generated when an mcelog error is received by the driver. Since, with udimm, this is counted by software, it is possible that some errors could be lost. With rdimm's, they displays the contents of the registers EISA bus support (Marc Zyngier ) This document groups random notes about porting EISA drivers to the new EISA/sysfs API. Starting from version 2.5.59, the EISA bus is almost given the same status as other much more mainstream busses such as PCI or USB. This has been possible through sysfs, which defines a nice enough set of abstractions to manage busses, devices and drivers. Although the new API is quite simple to use, converting existing drivers to the new infrastructure is not an easy task (mostly because detection code is generally also used to probe ISA cards). Moreover, most EISA drivers are among the oldest Linux drivers so, as you can imagine, some dust has settled here over the years. The EISA infrastructure is made up of three parts : - The bus code implements most of the generic code. It is shared among all the architectures that the EISA code runs on. It implements bus probing (detecting EISA cards available on the bus), allocates I/O resources, allows fancy naming through sysfs, and offers interfaces for driver to register. - The bus root driver implements the glue between the bus hardware and the generic bus code. It is responsible for discovering the device implementing the bus, and setting it up to be latter probed by the bus code. This can go from something as simple as reserving an I/O region on x86, to the rather more complex, like the hppa EISA code. This is the part to implement in order to have EISA running on an "new" platform. - The driver offers the bus a list of devices that it manages, and implements the necessary callbacks to probe and release devices whenever told to. Every function/structure below lives in , which depends heavily on . ** Bus root driver : int eisa_root_register (struct eisa_root_device *root); The eisa_root_register function is used to declare a device as the root of an EISA bus. The eisa_root_device structure holds a reference to this device, as well as some parameters for probing purposes. struct eisa_root_device { struct device *dev; /* Pointer to bridge device */ struct resource *res; unsigned long bus_base_addr; int slots; /* Max slot number */ int force_probe; /* Probe even when no slot 0 */ u64 dma_mask; /* from bridge device */ int bus_nr; /* Set by eisa_root_register */ struct resource eisa_root_res; /* ditto */ }; node : used for eisa_root_register internal purpose dev : pointer to the root device res : root device I/O resource bus_base_addr : slot 0 address on this bus slots : max slot number to probe force_probe : Probe even when slot 0 is empty (no EISA mainboard) dma_mask : Default DMA mask. Usually the bridge device dma_mask. bus_nr : unique bus id, set by eisa_root_register ** Driver : int eisa_driver_register (struct eisa_driver *edrv); void eisa_driver_unregister (struct eisa_driver *edrv); Clear enough ? struct eisa_device_id { char sig[EISA_SIG_LEN]; unsigned long driver_data; }; struct eisa_driver { const struct eisa_device_id *id_table; struct device_driver driver; }; id_table : an array of NULL terminated EISA id strings, followed by an empty string. Each string can optionally be paired with a driver-dependent value (driver_data). driver : a generic driver, such as described in Documentation/driver-model/driver.txt. Only .name, .probe and .remove members are mandatory. An example is the 3c59x driver : static struct eisa_device_id vortex_eisa_ids[] = { { "TCM5920", EISA_3C592_OFFSET }, { "TCM5970", EISA_3C597_OFFSET }, { "" } }; static struct eisa_driver vortex_eisa_driver = { .id_table = vortex_eisa_ids, .driver = { .name = "3c59x", .probe = vortex_eisa_probe, .remove = vortex_eisa_remove } }; ** Device : The sysfs framework calls .probe and .remove functions upon device discovery and removal (note that the .remove function is only called when driver is built as a module). Both functions are passed a pointer to a 'struct device', which is encapsulated in a 'struct eisa_device' described as follows : struct eisa_device { struct eisa_device_id id; int slot; int state; unsigned long base_addr; struct resource res[EISA_MAX_RESOURCES]; u64 dma_mask; struct device dev; /* generic device */ }; id : EISA id, as read from device. id.driver_data is set from the matching driver EISA id. slot : slot number which the device was detected on state : set of flags indicating the state of the device. Current flags are EISA_CONFIG_ENABLED and EISA_CONFIG_FORCED. res : set of four 256 bytes I/O regions allocated to this device dma_mask: DMA mask set from the parent device. dev : generic device (see Documentation/driver-model/device.txt) You can get the 'struct eisa_device' from 'struct device' using the 'to_eisa_device' macro. ** Misc stuff : void eisa_set_drvdata (struct eisa_device *edev, void *data); Stores data into the device's driver_data area. void *eisa_get_drvdata (struct eisa_device *edev): Gets the pointer previously stored into the device's driver_data area. int eisa_get_region_index (void *addr); Returns the region number (0 <= x < EISA_MAX_RESOURCES) of a given address. ** Kernel parameters : eisa_bus.enable_dev : A comma-separated list of slots to be enabled, even if the firmware set the card as disabled. The driver must be able to properly initialize the device in such conditions. eisa_bus.disable_dev : A comma-separated list of slots to be enabled, even if the firmware set the card as enabled. The driver won't be called to handle this device. virtual_root.force_probe : Force the probing code to probe EISA slots even when it cannot find an EISA compliant mainboard (nothing appears on slot 0). Defaults to 0 (don't force), and set to 1 (force probing) when either CONFIG_ALPHA_JENSEN or CONFIG_EISA_VLB_PRIMING are set. ** Random notes : Converting an EISA driver to the new API mostly involves *deleting* code (since probing is now in the core EISA code). Unfortunately, most drivers share their probing routine between ISA, MCA and EISA. Special care must be taken when ripping out the EISA code, so other busses won't suffer from these surgical strikes... You *must not* expect any EISA device to be detected when returning from eisa_driver_register, since the chances are that the bus has not yet been probed. In fact, that's what happens most of the time (the bus root driver usually kicks in rather late in the boot process). Unfortunately, most drivers are doing the probing by themselves, and expect to have explored the whole machine when they exit their probe routine. For example, switching your favorite EISA SCSI card to the "hotplug" model is "the right thing"(tm). ** Thanks : I'd like to thank the following people for their help : - Xavier Benigni for lending me a wonderful Alpha Jensen, - James Bottomley, Jeff Garzik for getting this stuff into the kernel, - Andries Brouwer for contributing numerous EISA ids, - Catrin Jones for coping with far too many machines at home. Email clients info for Linux ====================================================================== General Preferences ---------------------------------------------------------------------- Patches for the Linux kernel are submitted via email, preferably as inline text in the body of the email. Some maintainers accept attachments, but then the attachments should have content-type "text/plain". However, attachments are generally frowned upon because it makes quoting portions of the patch more difficult in the patch review process. Email clients that are used for Linux kernel patches should send the patch text untouched. For example, they should not modify or delete tabs or spaces, even at the beginning or end of lines. Don't send patches with "format=flowed". This can cause unexpected and unwanted line breaks. Don't let your email client do automatic word wrapping for you. This can also corrupt your patch. Email clients should not modify the character set encoding of the text. Emailed patches should be in ASCII or UTF-8 encoding only. If you configure your email client to send emails with UTF-8 encoding, you avoid some possible charset problems. Email clients should generate and maintain References: or In-Reply-To: headers so that mail threading is not broken. Copy-and-paste (or cut-and-paste) usually does not work for patches because tabs are converted to spaces. Using xclipboard, xclip, and/or xcutsel may work, but it's best to test this for yourself or just avoid copy-and-paste. Don't use PGP/GPG signatures in mail that contains patches. This breaks many scripts that read and apply the patches. (This should be fixable.) It's a good idea to send a patch to yourself, save the received message, and successfully apply it with 'patch' before sending patches to Linux mailing lists. Some email client (MUA) hints ---------------------------------------------------------------------- Here are some specific MUA configuration hints for editing and sending patches for the Linux kernel. These are not meant to be complete software package configuration summaries. Legend: TUI = text-based user interface GUI = graphical user interface ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Alpine (TUI) Config options: In the "Sending Preferences" section: - "Do Not Send Flowed Text" must be enabled - "Strip Whitespace Before Sending" must be disabled When composing the message, the cursor should be placed where the patch should appear, and then pressing CTRL-R let you specify the patch file to insert into the message. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Evolution (GUI) Some people use this successfully for patches. When composing mail select: Preformat from Format->Heading->Preformatted (Ctrl-7) or the toolbar Then use: Insert->Text File... (Alt-n x) to insert the patch. You can also "diff -Nru old.c new.c | xclip", select Preformat, then paste with the middle button. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Kmail (GUI) Some people use Kmail successfully for patches. The default setting of not composing in HTML is appropriate; do not enable it. When composing an email, under options, uncheck "word wrap". The only disadvantage is any text you type in the email will not be word-wrapped so you will have to manually word wrap text before the patch. The easiest way around this is to compose your email with word wrap enabled, then save it as a draft. Once you pull it up again from your drafts it is now hard word-wrapped and you can uncheck "word wrap" without losing the existing wrapping. At the bottom of your email, put the commonly-used patch delimiter before inserting your patch: three hyphens (---). Then from the "Message" menu item, select insert file and choose your patch. As an added bonus you can customise the message creation toolbar menu and put the "insert file" icon there. Make the the composer window wide enough so that no lines wrap. As of KMail 1.13.5 (KDE 4.5.4), KMail will apply word wrapping when sending the email if the lines wrap in the composer window. Having word wrapping disabled in the Options menu isn't enough. Thus, if your patch has very long lines, you must make the composer window very wide before sending the email. See: https://bugs.kde.org/show_bug.cgi?id=174034 You can safely GPG sign attachments, but inlined text is preferred for patches so do not GPG sign them. Signing patches that have been inserted as inlined text will make them tricky to extract from their 7-bit encoding. If you absolutely must send patches as attachments instead of inlining them as text, right click on the attachment and select properties, and highlight "Suggest automatic display" to make the attachment inlined to make it more viewable. When saving patches that are sent as inlined text, select the email that contains the patch from the message list pane, right click and select "save as". You can use the whole email unmodified as a patch if it was properly composed. There is no option currently to save the email when you are actually viewing it in its own window -- there has been a request filed at kmail's bugzilla and hopefully this will be addressed. Emails are saved as read-write for user only so you will have to chmod them to make them group and world readable if you copy them elsewhere. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Lotus Notes (GUI) Run away from it. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Mutt (TUI) Plenty of Linux developers use mutt, so it must work pretty well. Mutt doesn't come with an editor, so whatever editor you use should be used in a way that there are no automatic linebreaks. Most editors have an "insert file" option that inserts the contents of a file unaltered. To use 'vim' with mutt: set editor="vi" If using xclip, type the command :set paste before middle button or shift-insert or use :r filename if you want to include the patch inline. (a)ttach works fine without "set paste". Config options: It should work with default settings. However, it's a good idea to set the "send_charset" to: set send_charset="us-ascii:utf-8" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Pine (TUI) Pine has had some whitespace truncation issues in the past, but these should all be fixed now. Use alpine (pine's successor) if you can. Config options: - quell-flowed-text is needed for recent versions - the "no-strip-whitespace-before-send" option is needed ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Sylpheed (GUI) - Works well for inlining text (or using attachments). - Allows use of an external editor. - Is slow on large folders. - Won't do TLS SMTP auth over a non-SSL connection. - Has a helpful ruler bar in the compose window. - Adding addresses to address book doesn't understand the display name properly. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Thunderbird (GUI) Thunderbird is an Outlook clone that likes to mangle text, but there are ways to coerce it into behaving. - Allows use of an external editor: The easiest thing to do with Thunderbird and patches is to use an "external editor" extension and then just use your favorite $EDITOR for reading/merging patches into the body text. To do this, download and install the extension, then add a button for it using View->Toolbars->Customize... and finally just click on it when in the Compose dialog. To beat some sense out of the internal editor, do this: - Edit your Thunderbird config settings so that it won't use format=flowed. Go to "edit->preferences->advanced->config editor" to bring up the thunderbird's registry editor, and set "mailnews.send_plaintext_flowed" to "false". - Disable HTML Format: Set "mail.identity.id1.compose_html" to "false". - Enable "preformat" mode: Set "editor.quotesPreformatted" to "true". - Enable UTF8: Set "prefs.converted-to-utf8" to "true". - Install the "toggle wordwrap" extension. Download the file from: https://addons.mozilla.org/thunderbird/addon/2351/ Then go to "tools->add ons", select "install" at the bottom of the screen, and browse to where you saved the .xul file. This adds an "Enable Wordwrap" entry under the Options menu of the message composer. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ TkRat (GUI) Works. Use "Insert file..." or external editor. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Gmail (Web GUI) Does not work for sending patches. Gmail web client converts tabs to spaces automatically. At the same time it wraps lines every 78 chars with CRLF style line breaks although tab2space problem can be solved with external editor. Another problem is that Gmail will base64-encode any message that has a non-ASCII character. That includes things like European names. ### The following is a list of files and features that are going to be removed in the kernel source tree. Every entry should contain what exactly is going away, why it is happening, and who is going to be doing the work. When the feature is removed from the kernel, it should also be removed from this file. --------------------------- What: x86 floppy disable_hlt When: 2012 Why: ancient workaround of dubious utility clutters the code used by everybody else. Who: Len Brown --------------------------- What: CONFIG_APM_CPU_IDLE, and its ability to call APM BIOS in idle When: 2012 Why: This optional sub-feature of APM is of dubious reliability, and ancient APM laptops are likely better served by calling HLT. Deleting CONFIG_APM_CPU_IDLE allows x86 to stop exporting the pm_idle function pointer to modules. Who: Len Brown ---------------------------- What: x86_32 "no-hlt" cmdline param When: 2012 Why: remove a branch from idle path, simplify code used by everybody. This option disabled the use of HLT in idle and machine_halt() for hardware that was flakey 15-years ago. Today we have "idle=poll" that removed HLT from idle, and so if such a machine is still running the upstream kernel, "idle=poll" is likely sufficient. Who: Len Brown ---------------------------- What: x86 "idle=mwait" cmdline param When: 2012 Why: simplify x86 idle code Who: Len Brown ---------------------------- What: PRISM54 When: 2.6.34 Why: prism54 FullMAC PCI / Cardbus devices used to be supported only by the prism54 wireless driver. After Intersil stopped selling these devices in preference for the newer more flexible SoftMAC devices a SoftMAC device driver was required and prism54 did not support them. The p54pci driver now exists and has been present in the kernel for a while. This driver supports both SoftMAC devices and FullMAC devices. The main difference between these devices was the amount of memory which could be used for the firmware. The SoftMAC devices support a smaller amount of memory. Because of this the SoftMAC firmware fits into FullMAC devices's memory. p54pci supports not only PCI / Cardbus but also USB and SPI. Since p54pci supports all devices prism54 supports you will have a conflict. I'm not quite sure how distributions are handling this conflict right now. prism54 was kept around due to claims users may experience issues when using the SoftMAC driver. Time has passed users have not reported issues. If you use prism54 and for whatever reason you cannot use p54pci please let us know! E-mail us at: linux-wireless@vger.kernel.org For more information see the p54 wiki page: http://wireless.kernel.org/en/users/Drivers/p54 Who: Luis R. Rodriguez --------------------------- What: IRQF_SAMPLE_RANDOM Check: IRQF_SAMPLE_RANDOM When: July 2009 Why: Many of IRQF_SAMPLE_RANDOM users are technically bogus as entropy sources in the kernel's current entropy model. To resolve this, every input point to the kernel's entropy pool needs to better document the type of entropy source it actually is. This will be replaced with additional add_*_randomness functions in drivers/char/random.c Who: Robin Getz & Matt Mackall --------------------------- What: The ieee80211_regdom module parameter When: March 2010 / desktop catchup Why: This was inherited by the CONFIG_WIRELESS_OLD_REGULATORY code, and currently serves as an option for users to define an ISO / IEC 3166 alpha2 code for the country they are currently present in. Although there are userspace API replacements for this through nl80211 distributions haven't yet caught up with implementing decent alternatives through standard GUIs. Although available as an option through iw or wpa_supplicant its just a matter of time before distributions pick up good GUI options for this. The ideal solution would actually consist of intelligent designs which would do this for the user automatically even when travelling through different countries. Until then we leave this module parameter as a compromise. When userspace improves with reasonable widely-available alternatives for this we will no longer need this module parameter. This entry hopes that by the super-futuristically looking date of "March 2010" we will have such replacements widely available. Who: Luis R. Rodriguez --------------------------- What: dev->power.power_state When: July 2007 Why: Broken design for runtime control over driver power states, confusing driver-internal runtime power management with: mechanisms to support system-wide sleep state transitions; event codes that distinguish different phases of swsusp "sleep" transitions; and userspace policy inputs. This framework was never widely used, and most attempts to use it were broken. Drivers should instead be exposing domain-specific interfaces either to kernel or to userspace. Who: Pavel Machek --------------------------- What: /proc//oom_adj When: August 2012 Why: /proc//oom_adj allows userspace to influence the oom killer's badness heuristic used to determine which task to kill when the kernel is out of memory. The badness heuristic has since been rewritten since the introduction of this tunable such that its meaning is deprecated. The value was implemented as a bitshift on a score generated by the badness() function that did not have any precise units of measure. With the rewrite, the score is given as a proportion of available memory to the task allocating pages, so using a bitshift which grows the score exponentially is, thus, impossible to tune with fine granularity. A much more powerful interface, /proc//oom_score_adj, was introduced with the oom killer rewrite that allows users to increase or decrease the badness score linearly. This interface will replace /proc//oom_adj. A warning will be emitted to the kernel log if an application uses this deprecated interface. After it is printed once, future warnings will be suppressed until the kernel is rebooted. --------------------------- What: remove EXPORT_SYMBOL(kernel_thread) When: August 2006 Files: arch/*/kernel/*_ksyms.c Check: kernel_thread Why: kernel_thread is a low-level implementation detail. Drivers should use the API instead which shields them from implementation details and provides a higherlevel interface that prevents bugs and code duplication Who: Christoph Hellwig --------------------------- What: Unused EXPORT_SYMBOL/EXPORT_SYMBOL_GPL exports (temporary transition config option provided until then) The transition config option will also be removed at the same time. When: before 2.6.19 Why: Unused symbols are both increasing the size of the kernel binary and are often a sign of "wrong API" Who: Arjan van de Ven --------------------------- What: PHYSDEVPATH, PHYSDEVBUS, PHYSDEVDRIVER in the uevent environment When: October 2008 Why: The stacking of class devices makes these values misleading and inconsistent. Class devices should not carry any of these properties, and bus devices have SUBSYTEM and DRIVER as a replacement. Who: Kay Sievers --------------------------- What: ACPI procfs interface When: July 2008 Why: ACPI sysfs conversion should be finished by January 2008. ACPI procfs interface will be removed in July 2008 so that there is enough time for the user space to catch up. Who: Zhang Rui --------------------------- What: CONFIG_ACPI_PROCFS_POWER When: 2.6.39 Why: sysfs I/F for ACPI power devices, including AC and Battery, has been working in upstream kernel since 2.6.24, Sep 2007. In 2.6.37, we make the sysfs I/F always built in and this option disabled by default. Remove this option and the ACPI power procfs interface in 2.6.39. Who: Zhang Rui --------------------------- What: /proc/acpi/event When: February 2008 Why: /proc/acpi/event has been replaced by events via the input layer and netlink since 2.6.23. Who: Len Brown --------------------------- What: i386/x86_64 bzImage symlinks When: April 2010 Why: The i386/x86_64 merge provides a symlink to the old bzImage location so not yet updated user space tools, e.g. package scripts, do not break. Who: Thomas Gleixner --------------------------- What: GPIO autorequest on gpio_direction_{input,output}() in gpiolib When: February 2010 Why: All callers should use explicit gpio_request()/gpio_free(). The autorequest mechanism in gpiolib was provided mostly as a migration aid for legacy GPIO interfaces (for SOC based GPIOs). Those users have now largely migrated. Platforms implementing the GPIO interfaces without using gpiolib will see no changes. Who: David Brownell --------------------------- What: b43 support for firmware revision < 410 When: The schedule was July 2008, but it was decided that we are going to keep the code as long as there are no major maintanance headaches. So it _could_ be removed _any_ time now, if it conflicts with something new. Why: The support code for the old firmware hurts code readability/maintainability and slightly hurts runtime performance. Bugfixes for the old firmware are not provided by Broadcom anymore. Who: Michael Buesch --------------------------- What: Ability for non root users to shm_get hugetlb pages based on mlock resource limits When: 2.6.31 Why: Non root users need to be part of /proc/sys/vm/hugetlb_shm_group or have CAP_IPC_LOCK to be able to allocate shm segments backed by huge pages. The mlock based rlimit check to allow shm hugetlb is inconsistent with mmap based allocations. Hence it is being deprecated. Who: Ravikiran Thirumalai --------------------------- What: Code that is now under CONFIG_WIRELESS_EXT_SYSFS (in net/core/net-sysfs.c) When: 3.5 Why: Over 1K .text/.data size reduction, data is available in other ways (ioctls) Who: Johannes Berg --------------------------- What: sysfs ui for changing p4-clockmod parameters When: September 2009 Why: See commits 129f8ae9b1b5be94517da76009ea956e89104ce8 and e088e4c9cdb618675874becb91b2fd581ee707e6. Removal is subject to fixing any remaining bugs in ACPI which may cause the thermal throttling not to happen at the right time. Who: Dave Jones , Matthew Garrett ----------------------------- What: fakephp and associated sysfs files in /sys/bus/pci/slots/ When: 2011 Why: In 2.6.27, the semantics of /sys/bus/pci/slots was redefined to represent a machine's physical PCI slots. The change in semantics had userspace implications, as the hotplug core no longer allowed drivers to create multiple sysfs files per physical slot (required for multi-function devices, e.g.). fakephp was seen as a developer's tool only, and its interface changed. Too late, we learned that there were some users of the fakephp interface. In 2.6.30, the original fakephp interface was restored. At the same time, the PCI core gained the ability that fakephp provided, namely function-level hot-remove and hot-add. Since the PCI core now provides the same functionality, exposed in: /sys/bus/pci/rescan /sys/bus/pci/devices/.../remove /sys/bus/pci/devices/.../rescan there is no functional reason to maintain fakephp as well. We will keep the existing module so that 'modprobe fakephp' will present the old /sys/bus/pci/slots/... interface for compatibility, but users are urged to migrate their applications to the API above. After a reasonable transition period, we will remove the legacy fakephp interface. Who: Alex Chiang --------------------------- What: CONFIG_RFKILL_INPUT When: 2.6.33 Why: Should be implemented in userspace, policy daemon. Who: Johannes Berg ---------------------------- What: sound-slot/service-* module aliases and related clutters in sound/sound_core.c When: August 2010 Why: OSS sound_core grabs all legacy minors (0-255) of SOUND_MAJOR (14) and requests modules using custom sound-slot/service-* module aliases. The only benefit of doing this is allowing use of custom module aliases which might as well be considered a bug at this point. This preemptive claiming prevents alternative OSS implementations. Till the feature is removed, the kernel will be requesting both sound-slot/service-* and the standard char-major-* module aliases and allow turning off the pre-claiming selectively via CONFIG_SOUND_OSS_CORE_PRECLAIM and soundcore.preclaim_oss kernel parameter. After the transition phase is complete, both the custom module aliases and switches to disable it will go away. This removal will also allow making ALSA OSS emulation independent of sound_core. The dependency will be broken then too. Who: Tejun Heo ---------------------------- What: sysfs-class-rfkill state file When: Feb 2014 Files: net/rfkill/core.c Why: Documented as obsolete since Feb 2010. This file is limited to 3 states while the rfkill drivers can have 4 states. Who: anybody or Florian Mickler ---------------------------- What: sysfs-class-rfkill claim file When: Feb 2012 Files: net/rfkill/core.c Why: It is not possible to claim an rfkill driver since 2007. This is Documented as obsolete since Feb 2010. Who: anybody or Florian Mickler ---------------------------- What: iwlwifi 50XX module parameters When: 3.0 Why: The "..50" modules parameters were used to configure 5000 series and up devices; different set of module parameters also available for 4965 with same functionalities. Consolidate both set into single place in drivers/net/wireless/iwlwifi/iwl-agn.c Who: Wey-Yi Guy ---------------------------- What: iwl4965 alias support When: 3.0 Why: Internal alias support has been present in module-init-tools for some time, the MODULE_ALIAS("iwl4965") boilerplate aliases can be removed with no impact. Who: Wey-Yi Guy --------------------------- What: xt_NOTRACK Files: net/netfilter/xt_NOTRACK.c When: April 2011 Why: Superseded by xt_CT Who: Netfilter developer team ---------------------------- What: IRQF_DISABLED When: 2.6.36 Why: The flag is a NOOP as we run interrupt handlers with interrupts disabled Who: Thomas Gleixner ---------------------------- What: PCI DMA unmap state API When: August 2012 Why: PCI DMA unmap state API (include/linux/pci-dma.h) was replaced with DMA unmap state API (DMA unmap state API can be used for any bus). Who: FUJITA Tomonori ---------------------------- What: iwlwifi disable_hw_scan module parameters When: 3.0 Why: Hareware scan is the prefer method for iwlwifi devices for scanning operation. Remove software scan support for all the iwlwifi devices. Who: Wey-Yi Guy ---------------------------- What: Legacy, non-standard chassis intrusion detection interface. When: June 2011 Why: The adm9240, w83792d and w83793 hardware monitoring drivers have legacy interfaces for chassis intrusion detection. A standard interface has been added to each driver, so the legacy interface can be removed. Who: Jean Delvare ---------------------------- What: xt_connlimit rev 0 When: 2012 Who: Jan Engelhardt Files: net/netfilter/xt_connlimit.c ---------------------------- What: ipt_addrtype match include file When: 2012 Why: superseded by xt_addrtype Who: Florian Westphal Files: include/linux/netfilter_ipv4/ipt_addrtype.h ---------------------------- What: i2c_driver.attach_adapter i2c_driver.detach_adapter When: September 2011 Why: These legacy callbacks should no longer be used as i2c-core offers a variety of preferable alternative ways to instantiate I2C devices. Who: Jean Delvare ---------------------------- What: Opening a radio device node will no longer automatically switch the tuner mode from tv to radio. When: 3.3 Why: Just opening a V4L device should not change the state of the hardware like that. It's very unexpected and against the V4L spec. Instead, you switch to radio mode by calling VIDIOC_S_FREQUENCY. This is the second and last step of the move to consistent handling of tv and radio tuners. Who: Hans Verkuil ---------------------------- What: g_file_storage driver When: 3.8 Why: This driver has been superseded by g_mass_storage. Who: Alan Stern ---------------------------- What: threeg and interface sysfs files in /sys/devices/platform/acer-wmi When: 2012 Why: In 3.0, we can now autodetect internal 3G device and already have the threeg rfkill device. So, we plan to remove threeg sysfs support for it's no longer necessary. We also plan to remove interface sysfs file that exposed which ACPI-WMI interface that was used by acer-wmi driver. It will replaced by information log when acer-wmi initial. Who: Lee, Chun-Yi --------------------------- What: /sys/devices/platform/_UDC_/udc/_UDC_/is_dualspeed file and is_dualspeed line in /sys/devices/platform/ci13xxx_*/udc/device file. When: 3.8 Why: The is_dualspeed file is superseded by maximum_speed in the same directory and is_dualspeed line in device file is superseded by max_speed line in the same file. The maximum_speed/max_speed specifies maximum speed supported by UDC. To check if dualspeeed is supported, check if the value is >= 3. Various possible speeds are defined in . Who: Michal Nazarewicz ---------------------------- What: The XFS nodelaylog mount option When: 3.3 Why: The delaylog mode that has been the default since 2.6.39 has proven stable, and the old code is in the way of additional improvements in the log code. Who: Christoph Hellwig ---------------------------- What: iwlagn alias support When: 3.5 Why: The iwlagn module has been renamed iwlwifi. The alias will be around for backward compatibility for several cycles and then dropped. Who: Don Fry ---------------------------- What: pci_scan_bus_parented() When: 3.5 Why: The pci_scan_bus_parented() interface creates a new root bus. The bus is created with default resources (ioport_resource and iomem_resource) that are always wrong, so we rely on arch code to correct them later. Callers of pci_scan_bus_parented() should convert to using pci_scan_root_bus() so they can supply a list of bus resources when the bus is created. Who: Bjorn Helgaas ---------------------------- What: The CAP9 SoC family will be removed When: 3.4 Files: arch/arm/mach-at91/at91cap9.c arch/arm/mach-at91/at91cap9_devices.c arch/arm/mach-at91/include/mach/at91cap9.h arch/arm/mach-at91/include/mach/at91cap9_matrix.h arch/arm/mach-at91/include/mach/at91cap9_ddrsdr.h arch/arm/mach-at91/board-cap9adk.c Why: The code is not actively maintained and platforms are now hard to find. Who: Nicolas Ferre Jean-Christophe PLAGNIOL-VILLARD Using flexible arrays in the kernel Last updated for 2.6.32 Jonathan Corbet Large contiguous memory allocations can be unreliable in the Linux kernel. Kernel programmers will sometimes respond to this problem by allocating pages with vmalloc(). This solution not ideal, though. On 32-bit systems, memory from vmalloc() must be mapped into a relatively small address space; it's easy to run out. On SMP systems, the page table changes required by vmalloc() allocations can require expensive cross-processor interrupts on all CPUs. And, on all systems, use of space in the vmalloc() range increases pressure on the translation lookaside buffer (TLB), reducing the performance of the system. In many cases, the need for memory from vmalloc() can be eliminated by piecing together an array from smaller parts; the flexible array library exists to make this task easier. A flexible array holds an arbitrary (within limits) number of fixed-sized objects, accessed via an integer index. Sparse arrays are handled reasonably well. Only single-page allocations are made, so memory allocation failures should be relatively rare. The down sides are that the arrays cannot be indexed directly, individual object size cannot exceed the system page size, and putting data into a flexible array requires a copy operation. It's also worth noting that flexible arrays do no internal locking at all; if concurrent access to an array is possible, then the caller must arrange for appropriate mutual exclusion. The creation of a flexible array is done with: #include struct flex_array *flex_array_alloc(int element_size, unsigned int total, gfp_t flags); The individual object size is provided by element_size, while total is the maximum number of objects which can be stored in the array. The flags argument is passed directly to the internal memory allocation calls. With the current code, using flags to ask for high memory is likely to lead to notably unpleasant side effects. It is also possible to define flexible arrays at compile time with: DEFINE_FLEX_ARRAY(name, element_size, total); This macro will result in a definition of an array with the given name; the element size and total will be checked for validity at compile time. Storing data into a flexible array is accomplished with a call to: int flex_array_put(struct flex_array *array, unsigned int element_nr, void *src, gfp_t flags); This call will copy the data from src into the array, in the position indicated by element_nr (which must be less than the maximum specified when the array was created). If any memory allocations must be performed, flags will be used. The return value is zero on success, a negative error code otherwise. There might possibly be a need to store data into a flexible array while running in some sort of atomic context; in this situation, sleeping in the memory allocator would be a bad thing. That can be avoided by using GFP_ATOMIC for the flags value, but, often, there is a better way. The trick is to ensure that any needed memory allocations are done before entering atomic context, using: int flex_array_prealloc(struct flex_array *array, unsigned int start, unsigned int nr_elements, gfp_t flags); This function will ensure that memory for the elements indexed in the range defined by start and nr_elements has been allocated. Thereafter, a flex_array_put() call on an element in that range is guaranteed not to block. Getting data back out of the array is done with: void *flex_array_get(struct flex_array *fa, unsigned int element_nr); The return value is a pointer to the data element, or NULL if that particular element has never been allocated. Note that it is possible to get back a valid pointer for an element which has never been stored in the array. Memory for array elements is allocated one page at a time; a single allocation could provide memory for several adjacent elements. Flexible array elements are normally initialized to the value FLEX_ARRAY_FREE (defined as 0x6c in ), so errors involving that number probably result from use of unstored array entries. Note that, if array elements are allocated with __GFP_ZERO, they will be initialized to zero and this poisoning will not happen. Individual elements in the array can be cleared with: int flex_array_clear(struct flex_array *array, unsigned int element_nr); This function will set the given element to FLEX_ARRAY_FREE and return zero. If storage for the indicated element is not allocated for the array, flex_array_clear() will return -EINVAL instead. Note that clearing an element does not release the storage associated with it; to reduce the allocated size of an array, call: int flex_array_shrink(struct flex_array *array); The return value will be the number of pages of memory actually freed. This function works by scanning the array for pages containing nothing but FLEX_ARRAY_FREE bytes, so (1) it can be expensive, and (2) it will not work if the array's pages are allocated with __GFP_ZERO. It is possible to remove all elements of an array with a call to: void flex_array_free_parts(struct flex_array *array); This call frees all elements, but leaves the array itself in place. Freeing the entire array is done with: void flex_array_free(struct flex_array *array); As of this writing, there are no users of flexible arrays in the mainline kernel. The functions described here are also not exported to modules; that will probably be fixed when somebody comes up with a need for it. Futex Requeue PI ---------------- Requeueing of tasks from a non-PI futex to a PI futex requires special handling in order to ensure the underlying rt_mutex is never left without an owner if it has waiters; doing so would break the PI boosting logic [see rt-mutex-desgin.txt] For the purposes of brevity, this action will be referred to as "requeue_pi" throughout this document. Priority inheritance is abbreviated throughout as "PI". Motivation ---------- Without requeue_pi, the glibc implementation of pthread_cond_broadcast() must resort to waking all the tasks waiting on a pthread_condvar and letting them try to sort out which task gets to run first in classic thundering-herd formation. An ideal implementation would wake the highest-priority waiter, and leave the rest to the natural wakeup inherent in unlocking the mutex associated with the condvar. Consider the simplified glibc calls: /* caller must lock mutex */ pthread_cond_wait(cond, mutex) { lock(cond->__data.__lock); unlock(mutex); do { unlock(cond->__data.__lock); futex_wait(cond->__data.__futex); lock(cond->__data.__lock); } while(...) unlock(cond->__data.__lock); lock(mutex); } pthread_cond_broadcast(cond) { lock(cond->__data.__lock); unlock(cond->__data.__lock); futex_requeue(cond->data.__futex, cond->mutex); } Once pthread_cond_broadcast() requeues the tasks, the cond->mutex has waiters. Note that pthread_cond_wait() attempts to lock the mutex only after it has returned to user space. This will leave the underlying rt_mutex with waiters, and no owner, breaking the previously mentioned PI-boosting algorithms. In order to support PI-aware pthread_condvar's, the kernel needs to be able to requeue tasks to PI futexes. This support implies that upon a successful futex_wait system call, the caller would return to user space already holding the PI futex. The glibc implementation would be modified as follows: /* caller must lock mutex */ pthread_cond_wait_pi(cond, mutex) { lock(cond->__data.__lock); unlock(mutex); do { unlock(cond->__data.__lock); futex_wait_requeue_pi(cond->__data.__futex); lock(cond->__data.__lock); } while(...) unlock(cond->__data.__lock); /* the kernel acquired the the mutex for us */ } pthread_cond_broadcast_pi(cond) { lock(cond->__data.__lock); unlock(cond->__data.__lock); futex_requeue_pi(cond->data.__futex, cond->mutex); } The actual glibc implementation will likely test for PI and make the necessary changes inside the existing calls rather than creating new calls for the PI cases. Similar changes are needed for pthread_cond_timedwait() and pthread_cond_signal(). Implementation -------------- In order to ensure the rt_mutex has an owner if it has waiters, it is necessary for both the requeue code, as well as the waiting code, to be able to acquire the rt_mutex before returning to user space. The requeue code cannot simply wake the waiter and leave it to acquire the rt_mutex as it would open a race window between the requeue call returning to user space and the waiter waking and starting to run. This is especially true in the uncontended case. The solution involves two new rt_mutex helper routines, rt_mutex_start_proxy_lock() and rt_mutex_finish_proxy_lock(), which allow the requeue code to acquire an uncontended rt_mutex on behalf of the waiter and to enqueue the waiter on a contended rt_mutex. Two new system calls provide the kernel<->user interface to requeue_pi: FUTEX_WAIT_REQUEUE_PI and FUTEX_REQUEUE_CMP_PI. FUTEX_WAIT_REQUEUE_PI is called by the waiter (pthread_cond_wait() and pthread_cond_timedwait()) to block on the initial futex and wait to be requeued to a PI-aware futex. The implementation is the result of a high-speed collision between futex_wait() and futex_lock_pi(), with some extra logic to check for the additional wake-up scenarios. FUTEX_REQUEUE_CMP_PI is called by the waker (pthread_cond_broadcast() and pthread_cond_signal()) to requeue and possibly wake the waiting tasks. Internally, this system call is still handled by futex_requeue (by passing requeue_pi=1). Before requeueing, futex_requeue() attempts to acquire the requeue target PI futex on behalf of the top waiter. If it can, this waiter is woken. futex_requeue() then proceeds to requeue the remaining nr_wake+nr_requeue tasks to the PI futex, calling rt_mutex_start_proxy_lock() prior to each requeue to prepare the task as a waiter on the underlying rt_mutex. It is possible that the lock can be acquired at this stage as well, if so, the next waiter is woken to finish the acquisition of the lock. FUTEX_REQUEUE_PI accepts nr_wake and nr_requeue as arguments, but their sum is all that really matters. futex_requeue() will wake or requeue up to nr_wake + nr_requeue tasks. It will wake only as many tasks as it can acquire the lock for, which in the majority of cases should be 0 as good programming practice dictates that the caller of either pthread_cond_broadcast() or pthread_cond_signal() acquire the mutex prior to making the call. FUTEX_REQUEUE_PI requires that nr_wake=1. nr_requeue should be INT_MAX for broadcast and 0 for signal. Using gcov with the Linux kernel ================================ 1. Introduction 2. Preparation 3. Customization 4. Files 5. Modules 6. Separated build and test machines 7. Troubleshooting Appendix A: sample script: gather_on_build.sh Appendix B: sample script: gather_on_test.sh 1. Introduction =============== gcov profiling kernel support enables the use of GCC's coverage testing tool gcov [1] with the Linux kernel. Coverage data of a running kernel is exported in gcov-compatible format via the "gcov" debugfs directory. To get coverage data for a specific file, change to the kernel build directory and use gcov with the -o option as follows (requires root): # cd /tmp/linux-out # gcov -o /sys/kernel/debug/gcov/tmp/linux-out/kernel spinlock.c This will create source code files annotated with execution counts in the current directory. In addition, graphical gcov front-ends such as lcov [2] can be used to automate the process of collecting data for the entire kernel and provide coverage overviews in HTML format. Possible uses: * debugging (has this line been reached at all?) * test improvement (how do I change my test to cover these lines?) * minimizing kernel configurations (do I need this option if the associated code is never run?) -- [1] http://gcc.gnu.org/onlinedocs/gcc/Gcov.html [2] http://ltp.sourceforge.net/coverage/lcov.php 2. Preparation ============== Configure the kernel with: CONFIG_DEBUG_FS=y CONFIG_GCOV_KERNEL=y and to get coverage data for the entire kernel: CONFIG_GCOV_PROFILE_ALL=y Note that kernels compiled with profiling flags will be significantly larger and run slower. Also CONFIG_GCOV_PROFILE_ALL may not be supported on all architectures. Profiling data will only become accessible once debugfs has been mounted: mount -t debugfs none /sys/kernel/debug 3. Customization ================ To enable profiling for specific files or directories, add a line similar to the following to the respective kernel Makefile: For a single file (e.g. main.o): GCOV_PROFILE_main.o := y For all files in one directory: GCOV_PROFILE := y To exclude files from being profiled even when CONFIG_GCOV_PROFILE_ALL is specified, use: GCOV_PROFILE_main.o := n and: GCOV_PROFILE := n Only files which are linked to the main kernel image or are compiled as kernel modules are supported by this mechanism. 4. Files ======== The gcov kernel support creates the following files in debugfs: /sys/kernel/debug/gcov Parent directory for all gcov-related files. /sys/kernel/debug/gcov/reset Global reset file: resets all coverage data to zero when written to. /sys/kernel/debug/gcov/path/to/compile/dir/file.gcda The actual gcov data file as understood by the gcov tool. Resets file coverage data to zero when written to. /sys/kernel/debug/gcov/path/to/compile/dir/file.gcno Symbolic link to a static data file required by the gcov tool. This file is generated by gcc when compiling with option -ftest-coverage. 5. Modules ========== Kernel modules may contain cleanup code which is only run during module unload time. The gcov mechanism provides a means to collect coverage data for such code by keeping a copy of the data associated with the unloaded module. This data remains available through debugfs. Once the module is loaded again, the associated coverage counters are initialized with the data from its previous instantiation. This behavior can be deactivated by specifying the gcov_persist kernel parameter: gcov_persist=0 At run-time, a user can also choose to discard data for an unloaded module by writing to its data file or the global reset file. 6. Separated build and test machines ==================================== The gcov kernel profiling infrastructure is designed to work out-of-the box for setups where kernels are built and run on the same machine. In cases where the kernel runs on a separate machine, special preparations must be made, depending on where the gcov tool is used: a) gcov is run on the TEST machine The gcov tool version on the test machine must be compatible with the gcc version used for kernel build. Also the following files need to be copied from build to test machine: from the source tree: - all C source files + headers from the build tree: - all C source files + headers - all .gcda and .gcno files - all links to directories It is important to note that these files need to be placed into the exact same file system location on the test machine as on the build machine. If any of the path components is symbolic link, the actual directory needs to be used instead (due to make's CURDIR handling). b) gcov is run on the BUILD machine The following files need to be copied after each test case from test to build machine: from the gcov directory in sysfs: - all .gcda files - all links to .gcno files These files can be copied to any location on the build machine. gcov must then be called with the -o option pointing to that directory. Example directory setup on the build machine: /tmp/linux: kernel source tree /tmp/out: kernel build directory as specified by make O= /tmp/coverage: location of the files copied from the test machine [user@build] cd /tmp/out [user@build] gcov -o /tmp/coverage/tmp/out/init main.c 7. Troubleshooting ================== Problem: Compilation aborts during linker step. Cause: Profiling flags are specified for source files which are not linked to the main kernel or which are linked by a custom linker procedure. Solution: Exclude affected source files from profiling by specifying GCOV_PROFILE := n or GCOV_PROFILE_basename.o := n in the corresponding Makefile. Problem: Files copied from sysfs appear empty or incomplete. Cause: Due to the way seq_file works, some tools such as cp or tar may not correctly copy files from sysfs. Solution: Use 'cat' to read .gcda files and 'cp -d' to copy links. Alternatively use the mechanism shown in Appendix B. Appendix A: gather_on_build.sh ============================== Sample script to gather coverage meta files on the build machine (see 6a): #!/bin/bash KSRC=$1 KOBJ=$2 DEST=$3 if [ -z "$KSRC" ] || [ -z "$KOBJ" ] || [ -z "$DEST" ]; then echo "Usage: $0 " >&2 exit 1 fi KSRC=$(cd $KSRC; printf "all:\n\t@echo \${CURDIR}\n" | make -f -) KOBJ=$(cd $KOBJ; printf "all:\n\t@echo \${CURDIR}\n" | make -f -) find $KSRC $KOBJ \( -name '*.gcno' -o -name '*.[ch]' -o -type l \) -a \ -perm /u+r,g+r | tar cfz $DEST -P -T - if [ $? -eq 0 ] ; then echo "$DEST successfully created, copy to test system and unpack with:" echo " tar xfz $DEST -P" else echo "Could not create file $DEST" fi Appendix B: gather_on_test.sh ============================= Sample script to gather coverage data files on the test machine (see 6b): #!/bin/bash -e DEST=$1 GCDA=/sys/kernel/debug/gcov if [ -z "$DEST" ] ; then echo "Usage: $0 " >&2 exit 1 fi TEMPDIR=$(mktemp -d) echo Collecting data.. find $GCDA -type d -exec mkdir -p $TEMPDIR/\{\} \; find $GCDA -name '*.gcda' -exec sh -c 'cat < $0 > '$TEMPDIR'/$0' {} \; find $GCDA -name '*.gcno' -exec sh -c 'cp -d $0 '$TEMPDIR'/$0' {} \; tar czf $DEST -C $TEMPDIR sys rm -rf $TEMPDIR echo "$DEST successfully created, copy to build system and unpack with:" echo " tar xfz $DEST" GPIO Interfaces This provides an overview of GPIO access conventions on Linux. These calls use the gpio_* naming prefix. No other calls should use that prefix, or the related __gpio_* prefix. What is a GPIO? =============== A "General Purpose Input/Output" (GPIO) is a flexible software-controlled digital signal. They are provided from many kinds of chip, and are familiar to Linux developers working with embedded and custom hardware. Each GPIO represents a bit connected to a particular pin, or "ball" on Ball Grid Array (BGA) packages. Board schematics show which external hardware connects to which GPIOs. Drivers can be written generically, so that board setup code passes such pin configuration data to drivers. System-on-Chip (SOC) processors heavily rely on GPIOs. In some cases, every non-dedicated pin can be configured as a GPIO; and most chips have at least several dozen of them. Programmable logic devices (like FPGAs) can easily provide GPIOs; multifunction chips like power managers, and audio codecs often have a few such pins to help with pin scarcity on SOCs; and there are also "GPIO Expander" chips that connect using the I2C or SPI serial busses. Most PC southbridges have a few dozen GPIO-capable pins (with only the BIOS firmware knowing how they're used). The exact capabilities of GPIOs vary between systems. Common options: - Output values are writable (high=1, low=0). Some chips also have options about how that value is driven, so that for example only one value might be driven ... supporting "wire-OR" and similar schemes for the other value (notably, "open drain" signaling). - Input values are likewise readable (1, 0). Some chips support readback of pins configured as "output", which is very useful in such "wire-OR" cases (to support bidirectional signaling). GPIO controllers may have input de-glitch/debounce logic, sometimes with software controls. - Inputs can often be used as IRQ signals, often edge triggered but sometimes level triggered. Such IRQs may be configurable as system wakeup events, to wake the system from a low power state. - Usually a GPIO will be configurable as either input or output, as needed by different product boards; single direction ones exist too. - Most GPIOs can be accessed while holding spinlocks, but those accessed through a serial bus normally can't. Some systems support both types. On a given board each GPIO is used for one specific purpose like monitoring MMC/SD card insertion/removal, detecting card writeprotect status, driving a LED, configuring a transceiver, bitbanging a serial bus, poking a hardware watchdog, sensing a switch, and so on. GPIO conventions ================ Note that this is called a "convention" because you don't need to do it this way, and it's no crime if you don't. There **are** cases where portability is not the main issue; GPIOs are often used for the kind of board-specific glue logic that may even change between board revisions, and can't ever be used on a board that's wired differently. Only least-common-denominator functionality can be very portable. Other features are platform-specific, and that can be critical for glue logic. Plus, this doesn't require any implementation framework, just an interface. One platform might implement it as simple inline functions accessing chip registers; another might implement it by delegating through abstractions used for several very different kinds of GPIO controller. (There is some optional code supporting such an implementation strategy, described later in this document, but drivers acting as clients to the GPIO interface must not care how it's implemented.) That said, if the convention is supported on their platform, drivers should use it when possible. Platforms must declare GENERIC_GPIO support in their Kconfig (boolean true), and provide an file. Drivers that can't work without standard GPIO calls should have Kconfig entries which depend on GENERIC_GPIO. The GPIO calls are available, either as "real code" or as optimized-away stubs, when drivers use the include file: #include If you stick to this convention then it'll be easier for other developers to see what your code is doing, and help maintain it. Note that these operations include I/O barriers on platforms which need to use them; drivers don't need to add them explicitly. Identifying GPIOs ----------------- GPIOs are identified by unsigned integers in the range 0..MAX_INT. That reserves "negative" numbers for other purposes like marking signals as "not available on this board", or indicating faults. Code that doesn't touch the underlying hardware treats these integers as opaque cookies. Platforms define how they use those integers, and usually #define symbols for the GPIO lines so that board-specific setup code directly corresponds to the relevant schematics. In contrast, drivers should only use GPIO numbers passed to them from that setup code, using platform_data to hold board-specific pin configuration data (along with other board specific data they need). That avoids portability problems. So for example one platform uses numbers 32-159 for GPIOs; while another uses numbers 0..63 with one set of GPIO controllers, 64-79 with another type of GPIO controller, and on one particular board 80-95 with an FPGA. The numbers need not be contiguous; either of those platforms could also use numbers 2000-2063 to identify GPIOs in a bank of I2C GPIO expanders. If you want to initialize a structure with an invalid GPIO number, use some negative number (perhaps "-EINVAL"); that will never be valid. To test if such number from such a structure could reference a GPIO, you may use this predicate: int gpio_is_valid(int number); A number that's not valid will be rejected by calls which may request or free GPIOs (see below). Other numbers may also be rejected; for example, a number might be valid but temporarily unused on a given board. Whether a platform supports multiple GPIO controllers is a platform-specific implementation issue, as are whether that support can leave "holes" in the space of GPIO numbers, and whether new controllers can be added at runtime. Such issues can affect things including whether adjacent GPIO numbers are both valid. Using GPIOs ----------- The first thing a system should do with a GPIO is allocate it, using the gpio_request() call; see later. One of the next things to do with a GPIO, often in board setup code when setting up a platform_device using the GPIO, is mark its direction: /* set as input or output, returning 0 or negative errno */ int gpio_direction_input(unsigned gpio); int gpio_direction_output(unsigned gpio, int value); The return value is zero for success, else a negative errno. It should be checked, since the get/set calls don't have error returns and since misconfiguration is possible. You should normally issue these calls from a task context. However, for spinlock-safe GPIOs it's OK to use them before tasking is enabled, as part of early board setup. For output GPIOs, the value provided becomes the initial output value. This helps avoid signal glitching during system startup. For compatibility with legacy interfaces to GPIOs, setting the direction of a GPIO implicitly requests that GPIO (see below) if it has not been requested already. That compatibility is being removed from the optional gpiolib framework. Setting the direction can fail if the GPIO number is invalid, or when that particular GPIO can't be used in that mode. It's generally a bad idea to rely on boot firmware to have set the direction correctly, since it probably wasn't validated to do more than boot Linux. (Similarly, that board setup code probably needs to multiplex that pin as a GPIO, and configure pullups/pulldowns appropriately.) Spinlock-Safe GPIO access ------------------------- Most GPIO controllers can be accessed with memory read/write instructions. Those don't need to sleep, and can safely be done from inside hard (nonthreaded) IRQ handlers and similar contexts. Use the following calls to access such GPIOs, for which gpio_cansleep() will always return false (see below): /* GPIO INPUT: return zero or nonzero */ int gpio_get_value(unsigned gpio); /* GPIO OUTPUT */ void gpio_set_value(unsigned gpio, int value); The values are boolean, zero for low, nonzero for high. When reading the value of an output pin, the value returned should be what's seen on the pin ... that won't always match the specified output value, because of issues including open-drain signaling and output latencies. The get/set calls have no error returns because "invalid GPIO" should have been reported earlier from gpio_direction_*(). However, note that not all platforms can read the value of output pins; those that can't should always return zero. Also, using these calls for GPIOs that can't safely be accessed without sleeping (see below) is an error. Platform-specific implementations are encouraged to optimize the two calls to access the GPIO value in cases where the GPIO number (and for output, value) are constant. It's normal for them to need only a couple of instructions in such cases (reading or writing a hardware register), and not to need spinlocks. Such optimized calls can make bitbanging applications a lot more efficient (in both space and time) than spending dozens of instructions on subroutine calls. GPIO access that may sleep -------------------------- Some GPIO controllers must be accessed using message based busses like I2C or SPI. Commands to read or write those GPIO values require waiting to get to the head of a queue to transmit a command and get its response. This requires sleeping, which can't be done from inside IRQ handlers. Platforms that support this type of GPIO distinguish them from other GPIOs by returning nonzero from this call (which requires a valid GPIO number, which should have been previously allocated with gpio_request): int gpio_cansleep(unsigned gpio); To access such GPIOs, a different set of accessors is defined: /* GPIO INPUT: return zero or nonzero, might sleep */ int gpio_get_value_cansleep(unsigned gpio); /* GPIO OUTPUT, might sleep */ void gpio_set_value_cansleep(unsigned gpio, int value); Accessing such GPIOs requires a context which may sleep, for example a threaded IRQ handler, and those accessors must be used instead of spinlock-safe accessors without the cansleep() name suffix. Other than the fact that these accessors might sleep, and will work on GPIOs that can't be accessed from hardIRQ handlers, these calls act the same as the spinlock-safe calls. ** IN ADDITION ** calls to setup and configure such GPIOs must be made from contexts which may sleep, since they may need to access the GPIO controller chip too: (These setup calls are usually made from board setup or driver probe/teardown code, so this is an easy constraint.) gpio_direction_input() gpio_direction_output() gpio_request() ## gpio_request_one() ## gpio_request_array() ## gpio_free_array() gpio_free() gpio_set_debounce() Claiming and Releasing GPIOs ---------------------------- To help catch system configuration errors, two calls are defined. /* request GPIO, returning 0 or negative errno. * non-null labels may be useful for diagnostics. */ int gpio_request(unsigned gpio, const char *label); /* release previously-claimed GPIO */ void gpio_free(unsigned gpio); Passing invalid GPIO numbers to gpio_request() will fail, as will requesting GPIOs that have already been claimed with that call. The return value of gpio_request() must be checked. You should normally issue these calls from a task context. However, for spinlock-safe GPIOs it's OK to request GPIOs before tasking is enabled, as part of early board setup. These calls serve two basic purposes. One is marking the signals which are actually in use as GPIOs, for better diagnostics; systems may have several hundred potential GPIOs, but often only a dozen are used on any given board. Another is to catch conflicts, identifying errors when (a) two or more drivers wrongly think they have exclusive use of that signal, or (b) something wrongly believes it's safe to remove drivers needed to manage a signal that's in active use. That is, requesting a GPIO can serve as a kind of lock. Some platforms may also use knowledge about what GPIOs are active for power management, such as by powering down unused chip sectors and, more easily, gating off unused clocks. Note that requesting a GPIO does NOT cause it to be configured in any way; it just marks that GPIO as in use. Separate code must handle any pin setup (e.g. controlling which pin the GPIO uses, pullup/pulldown). Also note that it's your responsibility to have stopped using a GPIO before you free it. Considering in most cases GPIOs are actually configured right after they are claimed, three additional calls are defined: /* request a single GPIO, with initial configuration specified by * 'flags', identical to gpio_request() wrt other arguments and * return value */ int gpio_request_one(unsigned gpio, unsigned long flags, const char *label); /* request multiple GPIOs in a single call */ int gpio_request_array(struct gpio *array, size_t num); /* release multiple GPIOs in a single call */ void gpio_free_array(struct gpio *array, size_t num); where 'flags' is currently defined to specify the following properties: * GPIOF_DIR_IN - to configure direction as input * GPIOF_DIR_OUT - to configure direction as output * GPIOF_INIT_LOW - as output, set initial level to LOW * GPIOF_INIT_HIGH - as output, set initial level to HIGH since GPIOF_INIT_* are only valid when configured as output, so group valid combinations as: * GPIOF_IN - configure as input * GPIOF_OUT_INIT_LOW - configured as output, initial level LOW * GPIOF_OUT_INIT_HIGH - configured as output, initial level HIGH In the future, these flags can be extended to support more properties such as open-drain status. Further more, to ease the claim/release of multiple GPIOs, 'struct gpio' is introduced to encapsulate all three fields as: struct gpio { unsigned gpio; unsigned long flags; const char *label; }; A typical example of usage: static struct gpio leds_gpios[] = { { 32, GPIOF_OUT_INIT_HIGH, "Power LED" }, /* default to ON */ { 33, GPIOF_OUT_INIT_LOW, "Green LED" }, /* default to OFF */ { 34, GPIOF_OUT_INIT_LOW, "Red LED" }, /* default to OFF */ { 35, GPIOF_OUT_INIT_LOW, "Blue LED" }, /* default to OFF */ { ... }, }; err = gpio_request_one(31, GPIOF_IN, "Reset Button"); if (err) ... err = gpio_request_array(leds_gpios, ARRAY_SIZE(leds_gpios)); if (err) ... gpio_free_array(leds_gpios, ARRAY_SIZE(leds_gpios)); GPIOs mapped to IRQs -------------------- GPIO numbers are unsigned integers; so are IRQ numbers. These make up two logically distinct namespaces (GPIO 0 need not use IRQ 0). You can map between them using calls like: /* map GPIO numbers to IRQ numbers */ int gpio_to_irq(unsigned gpio); /* map IRQ numbers to GPIO numbers (avoid using this) */ int irq_to_gpio(unsigned irq); Those return either the corresponding number in the other namespace, or else a negative errno code if the mapping can't be done. (For example, some GPIOs can't be used as IRQs.) It is an unchecked error to use a GPIO number that wasn't set up as an input using gpio_direction_input(), or to use an IRQ number that didn't originally come from gpio_to_irq(). These two mapping calls are expected to cost on the order of a single addition or subtraction. They're not allowed to sleep. Non-error values returned from gpio_to_irq() can be passed to request_irq() or free_irq(). They will often be stored into IRQ resources for platform devices, by the board-specific initialization code. Note that IRQ trigger options are part of the IRQ interface, e.g. IRQF_TRIGGER_FALLING, as are system wakeup capabilities. Non-error values returned from irq_to_gpio() would most commonly be used with gpio_get_value(), for example to initialize or update driver state when the IRQ is edge-triggered. Note that some platforms don't support this reverse mapping, so you should avoid using it. Emulating Open Drain Signals ---------------------------- Sometimes shared signals need to use "open drain" signaling, where only the low signal level is actually driven. (That term applies to CMOS transistors; "open collector" is used for TTL.) A pullup resistor causes the high signal level. This is sometimes called a "wire-AND"; or more practically, from the negative logic (low=true) perspective this is a "wire-OR". One common example of an open drain signal is a shared active-low IRQ line. Also, bidirectional data bus signals sometimes use open drain signals. Some GPIO controllers directly support open drain outputs; many don't. When you need open drain signaling but your hardware doesn't directly support it, there's a common idiom you can use to emulate it with any GPIO pin that can be used as either an input or an output: LOW: gpio_direction_output(gpio, 0) ... this drives the signal and overrides the pullup. HIGH: gpio_direction_input(gpio) ... this turns off the output, so the pullup (or some other device) controls the signal. If you are "driving" the signal high but gpio_get_value(gpio) reports a low value (after the appropriate rise time passes), you know some other component is driving the shared signal low. That's not necessarily an error. As one common example, that's how I2C clocks are stretched: a slave that needs a slower clock delays the rising edge of SCK, and the I2C master adjusts its signaling rate accordingly. What do these conventions omit? =============================== One of the biggest things these conventions omit is pin multiplexing, since this is highly chip-specific and nonportable. One platform might not need explicit multiplexing; another might have just two options for use of any given pin; another might have eight options per pin; another might be able to route a given GPIO to any one of several pins. (Yes, those examples all come from systems that run Linux today.) Related to multiplexing is configuration and enabling of the pullups or pulldowns integrated on some platforms. Not all platforms support them, or support them in the same way; and any given board might use external pullups (or pulldowns) so that the on-chip ones should not be used. (When a circuit needs 5 kOhm, on-chip 100 kOhm resistors won't do.) Likewise drive strength (2 mA vs 20 mA) and voltage (1.8V vs 3.3V) is a platform-specific issue, as are models like (not) having a one-to-one correspondence between configurable pins and GPIOs. There are other system-specific mechanisms that are not specified here, like the aforementioned options for input de-glitching and wire-OR output. Hardware may support reading or writing GPIOs in gangs, but that's usually configuration dependent: for GPIOs sharing the same bank. (GPIOs are commonly grouped in banks of 16 or 32, with a given SOC having several such banks.) Some systems can trigger IRQs from output GPIOs, or read values from pins not managed as GPIOs. Code relying on such mechanisms will necessarily be nonportable. Dynamic definition of GPIOs is not currently standard; for example, as a side effect of configuring an add-on board with some GPIO expanders. GPIO implementor's framework (OPTIONAL) ======================================= As noted earlier, there is an optional implementation framework making it easier for platforms to support different kinds of GPIO controller using the same programming interface. This framework is called "gpiolib". As a debugging aid, if debugfs is available a /sys/kernel/debug/gpio file will be found there. That will list all the controllers registered through this framework, and the state of the GPIOs currently in use. Controller Drivers: gpio_chip ----------------------------- In this framework each GPIO controller is packaged as a "struct gpio_chip" with information common to each controller of that type: - methods to establish GPIO direction - methods used to access GPIO values - flag saying whether calls to its methods may sleep - optional debugfs dump method (showing extra state like pullup config) - label for diagnostics There is also per-instance data, which may come from device.platform_data: the number of its first GPIO, and how many GPIOs it exposes. The code implementing a gpio_chip should support multiple instances of the controller, possibly using the driver model. That code will configure each gpio_chip and issue gpiochip_add(). Removing a GPIO controller should be rare; use gpiochip_remove() when it is unavoidable. Most often a gpio_chip is part of an instance-specific structure with state not exposed by the GPIO interfaces, such as addressing, power management, and more. Chips such as codecs will have complex non-GPIO state. Any debugfs dump method should normally ignore signals which haven't been requested as GPIOs. They can use gpiochip_is_requested(), which returns either NULL or the label associated with that GPIO when it was requested. Platform Support ---------------- To support this framework, a platform's Kconfig will "select" either ARCH_REQUIRE_GPIOLIB or ARCH_WANT_OPTIONAL_GPIOLIB and arrange that its includes and defines three functions: gpio_get_value(), gpio_set_value(), and gpio_cansleep(). It may also provide a custom value for ARCH_NR_GPIOS, so that it better reflects the number of GPIOs in actual use on that platform, without wasting static table space. (It should count both built-in/SoC GPIOs and also ones on GPIO expanders. ARCH_REQUIRE_GPIOLIB means that the gpiolib code will always get compiled into the kernel on that architecture. ARCH_WANT_OPTIONAL_GPIOLIB means the gpiolib code defaults to off and the user can enable it and build it into the kernel optionally. If neither of these options are selected, the platform does not support GPIOs through GPIO-lib and the code cannot be enabled by the user. Trivial implementations of those functions can directly use framework code, which always dispatches through the gpio_chip: #define gpio_get_value __gpio_get_value #define gpio_set_value __gpio_set_value #define gpio_cansleep __gpio_cansleep Fancier implementations could instead define those as inline functions with logic optimizing access to specific SOC-based GPIOs. For example, if the referenced GPIO is the constant "12", getting or setting its value could cost as little as two or three instructions, never sleeping. When such an optimization is not possible those calls must delegate to the framework code, costing at least a few dozen instructions. For bitbanged I/O, such instruction savings can be significant. For SOCs, platform-specific code defines and registers gpio_chip instances for each bank of on-chip GPIOs. Those GPIOs should be numbered/labeled to match chip vendor documentation, and directly match board schematics. They may well start at zero and go up to a platform-specific limit. Such GPIOs are normally integrated into platform initialization to make them always be available, from arch_initcall() or earlier; they can often serve as IRQs. Board Support ------------- For external GPIO controllers -- such as I2C or SPI expanders, ASICs, multi function devices, FPGAs or CPLDs -- most often board-specific code handles registering controller devices and ensures that their drivers know what GPIO numbers to use with gpiochip_add(). Their numbers often start right after platform-specific GPIOs. For example, board setup code could create structures identifying the range of GPIOs that chip will expose, and passes them to each GPIO expander chip using platform_data. Then the chip driver's probe() routine could pass that data to gpiochip_add(). Initialization order can be important. For example, when a device relies on an I2C-based GPIO, its probe() routine should only be called after that GPIO becomes available. That may mean the device should not be registered until calls for that GPIO can work. One way to address such dependencies is for such gpio_chip controllers to provide setup() and teardown() callbacks to board specific code; those board specific callbacks would register devices once all the necessary resources are available, and remove them later when the GPIO controller device becomes unavailable. Sysfs Interface for Userspace (OPTIONAL) ======================================== Platforms which use the "gpiolib" implementors framework may choose to configure a sysfs user interface to GPIOs. This is different from the debugfs interface, since it provides control over GPIO direction and value instead of just showing a gpio state summary. Plus, it could be present on production systems without debugging support. Given appropriate hardware documentation for the system, userspace could know for example that GPIO #23 controls the write protect line used to protect boot loader segments in flash memory. System upgrade procedures may need to temporarily remove that protection, first importing a GPIO, then changing its output state, then updating the code before re-enabling the write protection. In normal use, GPIO #23 would never be touched, and the kernel would have no need to know about it. Again depending on appropriate hardware documentation, on some systems userspace GPIO can be used to determine system configuration data that standard kernels won't know about. And for some tasks, simple userspace GPIO drivers could be all that the system really needs. Note that standard kernel drivers exist for common "LEDs and Buttons" GPIO tasks: "leds-gpio" and "gpio_keys", respectively. Use those instead of talking directly to the GPIOs; they integrate with kernel frameworks better than your userspace code could. Paths in Sysfs -------------- There are three kinds of entry in /sys/class/gpio: - Control interfaces used to get userspace control over GPIOs; - GPIOs themselves; and - GPIO controllers ("gpio_chip" instances). That's in addition to standard files including the "device" symlink. The control interfaces are write-only: /sys/class/gpio/ "export" ... Userspace may ask the kernel to export control of a GPIO to userspace by writing its number to this file. Example: "echo 19 > export" will create a "gpio19" node for GPIO #19, if that's not requested by kernel code. "unexport" ... Reverses the effect of exporting to userspace. Example: "echo 19 > unexport" will remove a "gpio19" node exported using the "export" file. GPIO signals have paths like /sys/class/gpio/gpio42/ (for GPIO #42) and have the following read/write attributes: /sys/class/gpio/gpioN/ "direction" ... reads as either "in" or "out". This value may normally be written. Writing as "out" defaults to initializing the value as low. To ensure glitch free operation, values "low" and "high" may be written to configure the GPIO as an output with that initial value. Note that this attribute *will not exist* if the kernel doesn't support changing the direction of a GPIO, or it was exported by kernel code that didn't explicitly allow userspace to reconfigure this GPIO's direction. "value" ... reads as either 0 (low) or 1 (high). If the GPIO is configured as an output, this value may be written; any nonzero value is treated as high. If the pin can be configured as interrupt-generating interrupt and if it has been configured to generate interrupts (see the description of "edge"), you can poll(2) on that file and poll(2) will return whenever the interrupt was triggered. If you use poll(2), set the events POLLPRI and POLLERR. If you use select(2), set the file descriptor in exceptfds. After poll(2) returns, either lseek(2) to the beginning of the sysfs file and read the new value or close the file and re-open it to read the value. "edge" ... reads as either "none", "rising", "falling", or "both". Write these strings to select the signal edge(s) that will make poll(2) on the "value" file return. This file exists only if the pin can be configured as an interrupt generating input pin. "active_low" ... reads as either 0 (false) or 1 (true). Write any nonzero value to invert the value attribute both for reading and writing. Existing and subsequent poll(2) support configuration via the edge attribute for "rising" and "falling" edges will follow this setting. GPIO controllers have paths like /sys/class/gpio/gpiochip42/ (for the controller implementing GPIOs starting at #42) and have the following read-only attributes: /sys/class/gpio/gpiochipN/ "base" ... same as N, the first GPIO managed by this chip "label" ... provided for diagnostics (not always unique) "ngpio" ... how many GPIOs this manges (N to N + ngpio - 1) Board documentation should in most cases cover what GPIOs are used for what purposes. However, those numbers are not always stable; GPIOs on a daughtercard might be different depending on the base board being used, or other cards in the stack. In such cases, you may need to use the gpiochip nodes (possibly in conjunction with schematics) to determine the correct GPIO number to use for a given signal. Exporting from Kernel code -------------------------- Kernel code can explicitly manage exports of GPIOs which have already been requested using gpio_request(): /* export the GPIO to userspace */ int gpio_export(unsigned gpio, bool direction_may_change); /* reverse gpio_export() */ void gpio_unexport(); /* create a sysfs link to an exported GPIO node */ int gpio_export_link(struct device *dev, const char *name, unsigned gpio) /* change the polarity of a GPIO node in sysfs */ int gpio_sysfs_set_active_low(unsigned gpio, int value); After a kernel driver requests a GPIO, it may only be made available in the sysfs interface by gpio_export(). The driver can control whether the signal direction may change. This helps drivers prevent userspace code from accidentally clobbering important system state. This explicit exporting can help with debugging (by making some kinds of experiments easier), or can provide an always-there interface that's suitable for documenting as part of a board support package. After the GPIO has been exported, gpio_export_link() allows creating symlinks from elsewhere in sysfs to the GPIO sysfs node. Drivers can use this to provide the interface under their own device in sysfs with a descriptive name. Drivers can use gpio_sysfs_set_active_low() to hide GPIO line polarity differences between boards from user space. This only affects the sysfs interface. Polarity change can be done both before and after gpio_export(), and previously enabled poll(2) support for either rising or falling edge will be reconfigured to follow this setting. Notes on the change from 16-bit UIDs to 32-bit UIDs: - kernel code MUST take into account __kernel_uid_t and __kernel_uid32_t when communicating between user and kernel space in an ioctl or data structure. - kernel code should use uid_t and gid_t in kernel-private structures and code. What's left to be done for 32-bit UIDs on all Linux architectures: - Disk quotas have an interesting limitation that is not related to the maximum UID/GID. They are limited by the maximum file size on the underlying filesystem, because quota records are written at offsets corresponding to the UID in question. Further investigation is needed to see if the quota system can cope properly with huge UIDs. If it can deal with 64-bit file offsets on all architectures, this should not be a problem. - Decide whether or not to keep backwards compatibility with the system accounting file, or if we should break it as the comments suggest (currently, the old 16-bit UID and GID are still written to disk, and part of the former pad space is used to store separate 32-bit UID and GID) - Need to validate that OS emulation calls the 16-bit UID compatibility syscalls, if the OS being emulated used 16-bit UIDs, or uses the 32-bit UID system calls properly otherwise. This affects at least: iBCS on Intel sparc32 emulation on sparc64 (need to support whatever new 32-bit UID system calls are added to sparc32) - Validate that all filesystems behave properly. At present, 32-bit UIDs _should_ work for: ext2 ufs isofs nfs coda udf Ioctl() fixups have been made for: ncpfs smbfs Filesystems with simple fixups to prevent 16-bit UID wraparound: minix sysv qnx4 Other filesystems have not been checked yet. - The ncpfs and smpfs filesystems cannot presently use 32-bit UIDs in all ioctl()s. Some new ioctl()s have been added with 32-bit UIDs, but more are needed. (as well as new user<->kernel data structures) - The ELF core dump format only supports 16-bit UIDs on arm, i386, m68k, sh, and sparc32. Fixing this is probably not that important, but would require adding a new ELF section. - The ioctl()s used to control the in-kernel NFS server only support 16-bit UIDs on arm, i386, m68k, sh, and sparc32. - make sure that the UID mapping feature of AX25 networking works properly (it should be safe because it's always used a 32-bit integer to communicate between user and kernel) Chris Wing wingc@umich.edu last updated: January 11, 2000 Introduction: The hw_random framework is software that makes use of a special hardware feature on your CPU or motherboard, a Random Number Generator (RNG). The software has two parts: a core providing the /dev/hw_random character device and its sysfs support, plus a hardware-specific driver that plugs into that core. To make the most effective use of these mechanisms, you should download the support software as well. Download the latest version of the "rng-tools" package from the hw_random driver's official Web site: http://sourceforge.net/projects/gkernel/ Those tools use /dev/hw_random to fill the kernel entropy pool, which is used internally and exported by the /dev/urandom and /dev/random special files. Theory of operation: CHARACTER DEVICE. Using the standard open() and read() system calls, you can read random data from the hardware RNG device. This data is NOT CHECKED by any fitness tests, and could potentially be bogus (if the hardware is faulty or has been tampered with). Data is only output if the hardware "has-data" flag is set, but nevertheless a security-conscious person would run fitness tests on the data before assuming it is truly random. The rng-tools package uses such tests in "rngd", and lets you run them by hand with a "rngtest" utility. /dev/hw_random is char device major 10, minor 183. CLASS DEVICE. There is a /sys/class/misc/hw_random node with two unique attributes, "rng_available" and "rng_current". The "rng_available" attribute lists the hardware-specific drivers available, while "rng_current" lists the one which is currently connected to /dev/hw_random. If your system has more than one RNG available, you may change the one used by writing a name from the list in "rng_available" into "rng_current". ========================================================================== Hardware driver for Intel/AMD/VIA Random Number Generators (RNG) Copyright 2000,2001 Jeff Garzik Copyright 2000,2001 Philipp Rumpf About the Intel RNG hardware, from the firmware hub datasheet: The Firmware Hub integrates a Random Number Generator (RNG) using thermal noise generated from inherently random quantum mechanical properties of silicon. When not generating new random bits the RNG circuitry will enter a low power state. Intel will provide a binary software driver to give third party software access to our RNG for use as a security feature. At this time, the RNG is only to be used with a system in an OS-present state. Intel RNG Driver notes: * FIXME: support poll(2) NOTE: request_mem_region was removed, for two reasons: 1) Only one RNG is supported by this driver, 2) The location used by the RNG is a fixed location in MMIO-addressable memory, 3) users with properly working BIOS e820 handling will always have the region in which the RNG is located reserved, so request_mem_region calls always fail for proper setups. However, for people who use mem=XX, BIOS e820 information is -not- in /proc/iomem, and request_mem_region(RNG_ADDR) can succeed. Driver details: Based on: Intel 82802AB/82802AC Firmware Hub (FWH) Datasheet May 1999 Order Number: 290658-002 R Intel 82802 Firmware Hub: Random Number Generator Programmer's Reference Manual December 1999 Order Number: 298029-001 R Intel 82802 Firmware HUB Random Number Generator Driver Copyright (c) 2000 Matt Sottek Special thanks to Matt Sottek. I did the "guts", he did the "brains" and all the testing. Hardware Spinlock Framework 1. Introduction Hardware spinlock modules provide hardware assistance for synchronization and mutual exclusion between heterogeneous processors and those not operating under a single, shared operating system. For example, OMAP4 has dual Cortex-A9, dual Cortex-M3 and a C64x+ DSP, each of which is running a different Operating System (the master, A9, is usually running Linux and the slave processors, the M3 and the DSP, are running some flavor of RTOS). A generic hwspinlock framework allows platform-independent drivers to use the hwspinlock device in order to access data structures that are shared between remote processors, that otherwise have no alternative mechanism to accomplish synchronization and mutual exclusion operations. This is necessary, for example, for Inter-processor communications: on OMAP4, cpu-intensive multimedia tasks are offloaded by the host to the remote M3 and/or C64x+ slave processors (by an IPC subsystem called Syslink). To achieve fast message-based communications, a minimal kernel support is needed to deliver messages arriving from a remote processor to the appropriate user process. This communication is based on simple data structures that is shared between the remote processors, and access to it is synchronized using the hwspinlock module (remote processor directly places new messages in this shared data structure). A common hwspinlock interface makes it possible to have generic, platform- independent, drivers. 2. User API struct hwspinlock *hwspin_lock_request(void); - dynamically assign an hwspinlock and return its address, or NULL in case an unused hwspinlock isn't available. Users of this API will usually want to communicate the lock's id to the remote core before it can be used to achieve synchronization. Should be called from a process context (might sleep). struct hwspinlock *hwspin_lock_request_specific(unsigned int id); - assign a specific hwspinlock id and return its address, or NULL if that hwspinlock is already in use. Usually board code will be calling this function in order to reserve specific hwspinlock ids for predefined purposes. Should be called from a process context (might sleep). int hwspin_lock_free(struct hwspinlock *hwlock); - free a previously-assigned hwspinlock; returns 0 on success, or an appropriate error code on failure (e.g. -EINVAL if the hwspinlock is already free). Should be called from a process context (might sleep). int hwspin_lock_timeout(struct hwspinlock *hwlock, unsigned int timeout); - lock a previously-assigned hwspinlock with a timeout limit (specified in msecs). If the hwspinlock is already taken, the function will busy loop waiting for it to be released, but give up when the timeout elapses. Upon a successful return from this function, preemption is disabled so the caller must not sleep, and is advised to release the hwspinlock as soon as possible, in order to minimize remote cores polling on the hardware interconnect. Returns 0 when successful and an appropriate error code otherwise (most notably -ETIMEDOUT if the hwspinlock is still busy after timeout msecs). The function will never sleep. int hwspin_lock_timeout_irq(struct hwspinlock *hwlock, unsigned int timeout); - lock a previously-assigned hwspinlock with a timeout limit (specified in msecs). If the hwspinlock is already taken, the function will busy loop waiting for it to be released, but give up when the timeout elapses. Upon a successful return from this function, preemption and the local interrupts are disabled, so the caller must not sleep, and is advised to release the hwspinlock as soon as possible. Returns 0 when successful and an appropriate error code otherwise (most notably -ETIMEDOUT if the hwspinlock is still busy after timeout msecs). The function will never sleep. int hwspin_lock_timeout_irqsave(struct hwspinlock *hwlock, unsigned int to, unsigned long *flags); - lock a previously-assigned hwspinlock with a timeout limit (specified in msecs). If the hwspinlock is already taken, the function will busy loop waiting for it to be released, but give up when the timeout elapses. Upon a successful return from this function, preemption is disabled, local interrupts are disabled and their previous state is saved at the given flags placeholder. The caller must not sleep, and is advised to release the hwspinlock as soon as possible. Returns 0 when successful and an appropriate error code otherwise (most notably -ETIMEDOUT if the hwspinlock is still busy after timeout msecs). The function will never sleep. int hwspin_trylock(struct hwspinlock *hwlock); - attempt to lock a previously-assigned hwspinlock, but immediately fail if it is already taken. Upon a successful return from this function, preemption is disabled so caller must not sleep, and is advised to release the hwspinlock as soon as possible, in order to minimize remote cores polling on the hardware interconnect. Returns 0 on success and an appropriate error code otherwise (most notably -EBUSY if the hwspinlock was already taken). The function will never sleep. int hwspin_trylock_irq(struct hwspinlock *hwlock); - attempt to lock a previously-assigned hwspinlock, but immediately fail if it is already taken. Upon a successful return from this function, preemption and the local interrupts are disabled so caller must not sleep, and is advised to release the hwspinlock as soon as possible. Returns 0 on success and an appropriate error code otherwise (most notably -EBUSY if the hwspinlock was already taken). The function will never sleep. int hwspin_trylock_irqsave(struct hwspinlock *hwlock, unsigned long *flags); - attempt to lock a previously-assigned hwspinlock, but immediately fail if it is already taken. Upon a successful return from this function, preemption is disabled, the local interrupts are disabled and their previous state is saved at the given flags placeholder. The caller must not sleep, and is advised to release the hwspinlock as soon as possible. Returns 0 on success and an appropriate error code otherwise (most notably -EBUSY if the hwspinlock was already taken). The function will never sleep. void hwspin_unlock(struct hwspinlock *hwlock); - unlock a previously-locked hwspinlock. Always succeed, and can be called from any context (the function never sleeps). Note: code should _never_ unlock an hwspinlock which is already unlocked (there is no protection against this). void hwspin_unlock_irq(struct hwspinlock *hwlock); - unlock a previously-locked hwspinlock and enable local interrupts. The caller should _never_ unlock an hwspinlock which is already unlocked. Doing so is considered a bug (there is no protection against this). Upon a successful return from this function, preemption and local interrupts are enabled. This function will never sleep. void hwspin_unlock_irqrestore(struct hwspinlock *hwlock, unsigned long *flags); - unlock a previously-locked hwspinlock. The caller should _never_ unlock an hwspinlock which is already unlocked. Doing so is considered a bug (there is no protection against this). Upon a successful return from this function, preemption is reenabled, and the state of the local interrupts is restored to the state saved at the given flags. This function will never sleep. int hwspin_lock_get_id(struct hwspinlock *hwlock); - retrieve id number of a given hwspinlock. This is needed when an hwspinlock is dynamically assigned: before it can be used to achieve mutual exclusion with a remote cpu, the id number should be communicated to the remote task with which we want to synchronize. Returns the hwspinlock id number, or -EINVAL if hwlock is null. 3. Typical usage #include #include int hwspinlock_example1(void) { struct hwspinlock *hwlock; int ret; /* dynamically assign a hwspinlock */ hwlock = hwspin_lock_request(); if (!hwlock) ... id = hwspin_lock_get_id(hwlock); /* probably need to communicate id to a remote processor now */ /* take the lock, spin for 1 sec if it's already taken */ ret = hwspin_lock_timeout(hwlock, 1000); if (ret) ... /* * we took the lock, do our thing now, but do NOT sleep */ /* release the lock */ hwspin_unlock(hwlock); /* free the lock */ ret = hwspin_lock_free(hwlock); if (ret) ... return ret; } int hwspinlock_example2(void) { struct hwspinlock *hwlock; int ret; /* * assign a specific hwspinlock id - this should be called early * by board init code. */ hwlock = hwspin_lock_request_specific(PREDEFINED_LOCK_ID); if (!hwlock) ... /* try to take it, but don't spin on it */ ret = hwspin_trylock(hwlock); if (!ret) { pr_info("lock is already taken\n"); return -EBUSY; } /* * we took the lock, do our thing now, but do NOT sleep */ /* release the lock */ hwspin_unlock(hwlock); /* free the lock */ ret = hwspin_lock_free(hwlock); if (ret) ... return ret; } 4. API for implementors int hwspin_lock_register(struct hwspinlock_device *bank, struct device *dev, const struct hwspinlock_ops *ops, int base_id, int num_locks); - to be called from the underlying platform-specific implementation, in order to register a new hwspinlock device (which is usually a bank of numerous locks). Should be called from a process context (this function might sleep). Returns 0 on success, or appropriate error code on failure. int hwspin_lock_unregister(struct hwspinlock_device *bank); - to be called from the underlying vendor-specific implementation, in order to unregister an hwspinlock device (which is usually a bank of numerous locks). Should be called from a process context (this function might sleep). Returns the address of hwspinlock on success, or NULL on error (e.g. if the hwspinlock is sill in use). 5. Important structs struct hwspinlock_device is a device which usually contains a bank of hardware locks. It is registered by the underlying hwspinlock implementation using the hwspin_lock_register() API. /** * struct hwspinlock_device - a device which usually spans numerous hwspinlocks * @dev: underlying device, will be used to invoke runtime PM api * @ops: platform-specific hwspinlock handlers * @base_id: id index of the first lock in this device * @num_locks: number of locks in this device * @lock: dynamically allocated array of 'struct hwspinlock' */ struct hwspinlock_device { struct device *dev; const struct hwspinlock_ops *ops; int base_id; int num_locks; struct hwspinlock lock[0]; }; struct hwspinlock_device contains an array of hwspinlock structs, each of which represents a single hardware lock: /** * struct hwspinlock - this struct represents a single hwspinlock instance * @bank: the hwspinlock_device structure which owns this lock * @lock: initialized and used by hwspinlock core * @priv: private data, owned by the underlying platform-specific hwspinlock drv */ struct hwspinlock { struct hwspinlock_device *bank; spinlock_t lock; void *priv; }; When registering a bank of locks, the hwspinlock driver only needs to set the priv members of the locks. The rest of the members are set and initialized by the hwspinlock core itself. 6. Implementation callbacks There are three possible callbacks defined in 'struct hwspinlock_ops': struct hwspinlock_ops { int (*trylock)(struct hwspinlock *lock); void (*unlock)(struct hwspinlock *lock); void (*relax)(struct hwspinlock *lock); }; The first two callbacks are mandatory: The ->trylock() callback should make a single attempt to take the lock, and return 0 on failure and 1 on success. This callback may _not_ sleep. The ->unlock() callback releases the lock. It always succeed, and it, too, may _not_ sleep. The ->relax() callback is optional. It is called by hwspinlock core while spinning on a lock, and can be used by the underlying implementation to force a delay between two successive invocations of ->trylock(). It may _not_ sleep. Explaining the dreaded "No init found." boot hang message ========================================================= OK, so you've got this pretty unintuitive message (currently located in init/main.c) and are wondering what the H*** went wrong. Some high-level reasons for failure (listed roughly in order of execution) to load the init binary are: A) Unable to mount root FS B) init binary doesn't exist on rootfs C) broken console device D) binary exists but dependencies not available E) binary cannot be loaded Detailed explanations: 0) Set "debug" kernel parameter (in bootloader config file or CONFIG_CMDLINE) to get more detailed kernel messages. A) make sure you have the correct root FS type (and root= kernel parameter points to the correct partition), required drivers such as storage hardware (such as SCSI or USB!) and filesystem (ext3, jffs2 etc.) are builtin (alternatively as modules, to be pre-loaded by an initrd) C) Possibly a conflict in console= setup --> initial console unavailable. E.g. some serial consoles are unreliable due to serial IRQ issues (e.g. missing interrupt-based configuration). Try using a different console= device or e.g. netconsole= . D) e.g. required library dependencies of the init binary such as /lib/ld-linux.so.2 missing or broken. Use readelf -d |grep NEEDED to find out which libraries are required. E) make sure the binary's architecture matches your hardware. E.g. i386 vs. x86_64 mismatch, or trying to load x86 on ARM hardware. In case you tried loading a non-binary file here (shell script?), you should make sure that the script specifies an interpreter in its shebang header line (#!/...) that is fully working (including its library dependencies). And before tackling scripts, better first test a simple non-script binary such as /bin/sh and confirm its successful execution. To find out more, add code to init/main.c to display kernel_execve()s return values. Please extend this explanation whenever you find new failure causes (after all loading the init binary is a CRITICAL and hard transition step which needs to be made as painless as possible), then submit patch to LKML. Further TODOs: - Implement the various run_init_process() invocations via a struct array which can then store the kernel_execve() result value and on failure log it all by iterating over _all_ results (very important usability fix). - try to make the implementation itself more helpful in general, e.g. by providing additional error messages at affected places. Andreas Mohr Using the initial RAM disk (initrd) =================================== Written 1996,2000 by Werner Almesberger and Hans Lermen initrd provides the capability to load a RAM disk by the boot loader. This RAM disk can then be mounted as the root file system and programs can be run from it. Afterwards, a new root file system can be mounted from a different device. The previous root (from initrd) is then moved to a directory and can be subsequently unmounted. initrd is mainly designed to allow system startup to occur in two phases, where the kernel comes up with a minimum set of compiled-in drivers, and where additional modules are loaded from initrd. This document gives a brief overview of the use of initrd. A more detailed discussion of the boot process can be found in [1]. Operation --------- When using initrd, the system typically boots as follows: 1) the boot loader loads the kernel and the initial RAM disk 2) the kernel converts initrd into a "normal" RAM disk and frees the memory used by initrd 3) if the root device is not /dev/ram0, the old (deprecated) change_root procedure is followed. see the "Obsolete root change mechanism" section below. 4) root device is mounted. if it is /dev/ram0, the initrd image is then mounted as root 5) /sbin/init is executed (this can be any valid executable, including shell scripts; it is run with uid 0 and can do basically everything init can do). 6) init mounts the "real" root file system 7) init places the root file system at the root directory using the pivot_root system call 8) init execs the /sbin/init on the new root filesystem, performing the usual boot sequence 9) the initrd file system is removed Note that changing the root directory does not involve unmounting it. It is therefore possible to leave processes running on initrd during that procedure. Also note that file systems mounted under initrd continue to be accessible. Boot command-line options ------------------------- initrd adds the following new options: initrd= (e.g. LOADLIN) Loads the specified file as the initial RAM disk. When using LILO, you have to specify the RAM disk image file in /etc/lilo.conf, using the INITRD configuration variable. noinitrd initrd data is preserved but it is not converted to a RAM disk and the "normal" root file system is mounted. initrd data can be read from /dev/initrd. Note that the data in initrd can have any structure in this case and doesn't necessarily have to be a file system image. This option is used mainly for debugging. Note: /dev/initrd is read-only and it can only be used once. As soon as the last process has closed it, all data is freed and /dev/initrd can't be opened anymore. root=/dev/ram0 initrd is mounted as root, and the normal boot procedure is followed, with the RAM disk mounted as root. Compressed cpio images ---------------------- Recent kernels have support for populating a ramdisk from a compressed cpio archive. On such systems, the creation of a ramdisk image doesn't need to involve special block devices or loopbacks; you merely create a directory on disk with the desired initrd content, cd to that directory, and run (as an example): find . | cpio --quiet -H newc -o | gzip -9 -n > /boot/imagefile.img Examining the contents of an existing image file is just as simple: mkdir /tmp/imagefile cd /tmp/imagefile gzip -cd /boot/imagefile.img | cpio -imd --quiet Installation ------------ First, a directory for the initrd file system has to be created on the "normal" root file system, e.g. # mkdir /initrd The name is not relevant. More details can be found on the pivot_root(2) man page. If the root file system is created during the boot procedure (i.e. if you're building an install floppy), the root file system creation procedure should create the /initrd directory. If initrd will not be mounted in some cases, its content is still accessible if the following device has been created: # mknod /dev/initrd b 1 250 # chmod 400 /dev/initrd Second, the kernel has to be compiled with RAM disk support and with support for the initial RAM disk enabled. Also, at least all components needed to execute programs from initrd (e.g. executable format and file system) must be compiled into the kernel. Third, you have to create the RAM disk image. This is done by creating a file system on a block device, copying files to it as needed, and then copying the content of the block device to the initrd file. With recent kernels, at least three types of devices are suitable for that: - a floppy disk (works everywhere but it's painfully slow) - a RAM disk (fast, but allocates physical memory) - a loopback device (the most elegant solution) We'll describe the loopback device method: 1) make sure loopback block devices are configured into the kernel 2) create an empty file system of the appropriate size, e.g. # dd if=/dev/zero of=initrd bs=300k count=1 # mke2fs -F -m0 initrd (if space is critical, you may want to use the Minix FS instead of Ext2) 3) mount the file system, e.g. # mount -t ext2 -o loop initrd /mnt 4) create the console device: # mkdir /mnt/dev # mknod /mnt/dev/console c 5 1 5) copy all the files that are needed to properly use the initrd environment. Don't forget the most important file, /sbin/init Note that /sbin/init's permissions must include "x" (execute). 6) correct operation the initrd environment can frequently be tested even without rebooting with the command # chroot /mnt /sbin/init This is of course limited to initrds that do not interfere with the general system state (e.g. by reconfiguring network interfaces, overwriting mounted devices, trying to start already running demons, etc. Note however that it is usually possible to use pivot_root in such a chroot'ed initrd environment.) 7) unmount the file system # umount /mnt 8) the initrd is now in the file "initrd". Optionally, it can now be compressed # gzip -9 initrd For experimenting with initrd, you may want to take a rescue floppy and only add a symbolic link from /sbin/init to /bin/sh. Alternatively, you can try the experimental newlib environment [2] to create a small initrd. Finally, you have to boot the kernel and load initrd. Almost all Linux boot loaders support initrd. Since the boot process is still compatible with an older mechanism, the following boot command line parameters have to be given: root=/dev/ram0 rw (rw is only necessary if writing to the initrd file system.) With LOADLIN, you simply execute LOADLIN initrd= e.g. LOADLIN C:\LINUX\BZIMAGE initrd=C:\LINUX\INITRD.GZ root=/dev/ram0 rw With LILO, you add the option INITRD= to either the global section or to the section of the respective kernel in /etc/lilo.conf, and pass the options using APPEND, e.g. image = /bzImage initrd = /boot/initrd.gz append = "root=/dev/ram0 rw" and run /sbin/lilo For other boot loaders, please refer to the respective documentation. Now you can boot and enjoy using initrd. Changing the root device ------------------------ When finished with its duties, init typically changes the root device and proceeds with starting the Linux system on the "real" root device. The procedure involves the following steps: - mounting the new root file system - turning it into the root file system - removing all accesses to the old (initrd) root file system - unmounting the initrd file system and de-allocating the RAM disk Mounting the new root file system is easy: it just needs to be mounted on a directory under the current root. Example: # mkdir /new-root # mount -o ro /dev/hda1 /new-root The root change is accomplished with the pivot_root system call, which is also available via the pivot_root utility (see pivot_root(8) man page; pivot_root is distributed with util-linux version 2.10h or higher [3]). pivot_root moves the current root to a directory under the new root, and puts the new root at its place. The directory for the old root must exist before calling pivot_root. Example: # cd /new-root # mkdir initrd # pivot_root . initrd Now, the init process may still access the old root via its executable, shared libraries, standard input/output/error, and its current root directory. All these references are dropped by the following command: # exec chroot . what-follows dev/console 2>&1 Where what-follows is a program under the new root, e.g. /sbin/init If the new root file system will be used with udev and has no valid /dev directory, udev must be initialized before invoking chroot in order to provide /dev/console. Note: implementation details of pivot_root may change with time. In order to ensure compatibility, the following points should be observed: - before calling pivot_root, the current directory of the invoking process should point to the new root directory - use . as the first argument, and the _relative_ path of the directory for the old root as the second argument - a chroot program must be available under the old and the new root - chroot to the new root afterwards - use relative paths for dev/console in the exec command Now, the initrd can be unmounted and the memory allocated by the RAM disk can be freed: # umount /initrd # blockdev --flushbufs /dev/ram0 It is also possible to use initrd with an NFS-mounted root, see the pivot_root(8) man page for details. Usage scenarios --------------- The main motivation for implementing initrd was to allow for modular kernel configuration at system installation. The procedure would work as follows: 1) system boots from floppy or other media with a minimal kernel (e.g. support for RAM disks, initrd, a.out, and the Ext2 FS) and loads initrd 2) /sbin/init determines what is needed to (1) mount the "real" root FS (i.e. device type, device drivers, file system) and (2) the distribution media (e.g. CD-ROM, network, tape, ...). This can be done by asking the user, by auto-probing, or by using a hybrid approach. 3) /sbin/init loads the necessary kernel modules 4) /sbin/init creates and populates the root file system (this doesn't have to be a very usable system yet) 5) /sbin/init invokes pivot_root to change the root file system and execs - via chroot - a program that continues the installation 6) the boot loader is installed 7) the boot loader is configured to load an initrd with the set of modules that was used to bring up the system (e.g. /initrd can be modified, then unmounted, and finally, the image is written from /dev/ram0 or /dev/rd/0 to a file) 8) now the system is bootable and additional installation tasks can be performed The key role of initrd here is to re-use the configuration data during normal system operation without requiring the use of a bloated "generic" kernel or re-compiling or re-linking the kernel. A second scenario is for installations where Linux runs on systems with different hardware configurations in a single administrative domain. In such cases, it is desirable to generate only a small set of kernels (ideally only one) and to keep the system-specific part of configuration information as small as possible. In this case, a common initrd could be generated with all the necessary modules. Then, only /sbin/init or a file read by it would have to be different. A third scenario is more convenient recovery disks, because information like the location of the root FS partition doesn't have to be provided at boot time, but the system loaded from initrd can invoke a user-friendly dialog and it can also perform some sanity checks (or even some form of auto-detection). Last not least, CD-ROM distributors may use it for better installation from CD, e.g. by using a boot floppy and bootstrapping a bigger RAM disk via initrd from CD; or by booting via a loader like LOADLIN or directly from the CD-ROM, and loading the RAM disk from CD without need of floppies. Obsolete root change mechanism ------------------------------ The following mechanism was used before the introduction of pivot_root. Current kernels still support it, but you should _not_ rely on its continued availability. It works by mounting the "real" root device (i.e. the one set with rdev in the kernel image or with root=... at the boot command line) as the root file system when linuxrc exits. The initrd file system is then unmounted, or, if it is still busy, moved to a directory /initrd, if such a directory exists on the new root file system. In order to use this mechanism, you do not have to specify the boot command options root, init, or rw. (If specified, they will affect the real root file system, not the initrd environment.) If /proc is mounted, the "real" root device can be changed from within linuxrc by writing the number of the new root FS device to the special file /proc/sys/kernel/real-root-dev, e.g. # echo 0x301 >/proc/sys/kernel/real-root-dev Note that the mechanism is incompatible with NFS and similar file systems. This old, deprecated mechanism is commonly called "change_root", while the new, supported mechanism is called "pivot_root". Mixed change_root and pivot_root mechanism ------------------------------------------ In case you did not want to use root=/dev/ram0 to trigger the pivot_root mechanism, you may create both /linuxrc and /sbin/init in your initrd image. /linuxrc would contain only the following: #! /bin/sh mount -n -t proc proc /proc echo 0x0100 >/proc/sys/kernel/real-root-dev umount -n /proc Once linuxrc exited, the kernel would mount again your initrd as root, this time executing /sbin/init. Again, it would be the duty of this init to build the right environment (maybe using the root= device passed on the cmdline) before the final execution of the real /sbin/init. Resources --------- [1] Almesberger, Werner; "Booting Linux: The History and the Future" http://www.almesberger.net/cv/papers/ols2k-9.ps.gz [2] newlib package (experimental), with initrd example http://sources.redhat.com/newlib/ [3] Brouwer, Andries; "util-linux: Miscellaneous utilities for Linux" ftp://ftp.win.tue.nl/pub/linux-local/utils/util-linux/ Intel(R) TXT Overview: ===================== Intel's technology for safer computing, Intel(R) Trusted Execution Technology (Intel(R) TXT), defines platform-level enhancements that provide the building blocks for creating trusted platforms. Intel TXT was formerly known by the code name LaGrande Technology (LT). Intel TXT in Brief: o Provides dynamic root of trust for measurement (DRTM) o Data protection in case of improper shutdown o Measurement and verification of launched environment Intel TXT is part of the vPro(TM) brand and is also available some non-vPro systems. It is currently available on desktop systems based on the Q35, X38, Q45, and Q43 Express chipsets (e.g. Dell Optiplex 755, HP dc7800, etc.) and mobile systems based on the GM45, PM45, and GS45 Express chipsets. For more information, see http://www.intel.com/technology/security/. This site also has a link to the Intel TXT MLE Developers Manual, which has been updated for the new released platforms. Intel TXT has been presented at various events over the past few years, some of which are: LinuxTAG 2008: http://www.linuxtag.org/2008/en/conf/events/vp-donnerstag.html TRUST2008: http://www.trust-conference.eu/downloads/Keynote-Speakers/ 3_David-Grawrock_The-Front-Door-of-Trusted-Computing.pdf IDF, Shanghai: http://www.prcidf.com.cn/index_en.html IDFs 2006, 2007 (I'm not sure if/where they are online) Trusted Boot Project Overview: ============================= Trusted Boot (tboot) is an open source, pre-kernel/VMM module that uses Intel TXT to perform a measured and verified launch of an OS kernel/VMM. It is hosted on SourceForge at http://sourceforge.net/projects/tboot. The mercurial source repo is available at http://www.bughost.org/ repos.hg/tboot.hg. Tboot currently supports launching Xen (open source VMM/hypervisor w/ TXT support since v3.2), and now Linux kernels. Value Proposition for Linux or "Why should you care?" ===================================================== While there are many products and technologies that attempt to measure or protect the integrity of a running kernel, they all assume the kernel is "good" to begin with. The Integrity Measurement Architecture (IMA) and Linux Integrity Module interface are examples of such solutions. To get trust in the initial kernel without using Intel TXT, a static root of trust must be used. This bases trust in BIOS starting at system reset and requires measurement of all code executed between system reset through the completion of the kernel boot as well as data objects used by that code. In the case of a Linux kernel, this means all of BIOS, any option ROMs, the bootloader and the boot config. In practice, this is a lot of code/data, much of which is subject to change from boot to boot (e.g. changing NICs may change option ROMs). Without reference hashes, these measurement changes are difficult to assess or confirm as benign. This process also does not provide DMA protection, memory configuration/alias checks and locks, crash protection, or policy support. By using the hardware-based root of trust that Intel TXT provides, many of these issues can be mitigated. Specifically: many pre-launch components can be removed from the trust chain, DMA protection is provided to all launched components, a large number of platform configuration checks are performed and values locked, protection is provided for any data in the event of an improper shutdown, and there is support for policy-based execution/verification. This provides a more stable measurement and a higher assurance of system configuration and initial state than would be otherwise possible. Since the tboot project is open source, source code for almost all parts of the trust chain is available (excepting SMM and Intel-provided firmware). How Does it Work? ================= o Tboot is an executable that is launched by the bootloader as the "kernel" (the binary the bootloader executes). o It performs all of the work necessary to determine if the platform supports Intel TXT and, if so, executes the GETSEC[SENTER] processor instruction that initiates the dynamic root of trust. - If tboot determines that the system does not support Intel TXT or is not configured correctly (e.g. the SINIT AC Module was incorrect), it will directly launch the kernel with no changes to any state. - Tboot will output various information about its progress to the terminal, serial port, and/or an in-memory log; the output locations can be configured with a command line switch. o The GETSEC[SENTER] instruction will return control to tboot and tboot then verifies certain aspects of the environment (e.g. TPM NV lock, e820 table does not have invalid entries, etc.). o It will wake the APs from the special sleep state the GETSEC[SENTER] instruction had put them in and place them into a wait-for-SIPI state. - Because the processors will not respond to an INIT or SIPI when in the TXT environment, it is necessary to create a small VT-x guest for the APs. When they run in this guest, they will simply wait for the INIT-SIPI-SIPI sequence, which will cause VMEXITs, and then disable VT and jump to the SIPI vector. This approach seemed like a better choice than having to insert special code into the kernel's MP wakeup sequence. o Tboot then applies an (optional) user-defined launch policy to verify the kernel and initrd. - This policy is rooted in TPM NV and is described in the tboot project. The tboot project also contains code for tools to create and provision the policy. - Policies are completely under user control and if not present then any kernel will be launched. - Policy action is flexible and can include halting on failures or simply logging them and continuing. o Tboot adjusts the e820 table provided by the bootloader to reserve its own location in memory as well as to reserve certain other TXT-related regions. o As part of its launch, tboot DMA protects all of RAM (using the VT-d PMRs). Thus, the kernel must be booted with 'intel_iommu=on' in order to remove this blanket protection and use VT-d's page-level protection. o Tboot will populate a shared page with some data about itself and pass this to the Linux kernel as it transfers control. - The location of the shared page is passed via the boot_params struct as a physical address. o The kernel will look for the tboot shared page address and, if it exists, map it. o As one of the checks/protections provided by TXT, it makes a copy of the VT-d DMARs in a DMA-protected region of memory and verifies them for correctness. The VT-d code will detect if the kernel was launched with tboot and use this copy instead of the one in the ACPI table. o At this point, tboot and TXT are out of the picture until a shutdown (S) o In order to put a system into any of the sleep states after a TXT launch, TXT must first be exited. This is to prevent attacks that attempt to crash the system to gain control on reboot and steal data left in memory. - The kernel will perform all of its sleep preparation and populate the shared page with the ACPI data needed to put the platform in the desired sleep state. - Then the kernel jumps into tboot via the vector specified in the shared page. - Tboot will clean up the environment and disable TXT, then use the kernel-provided ACPI information to actually place the platform into the desired sleep state. - In the case of S3, tboot will also register itself as the resume vector. This is necessary because it must re-establish the measured environment upon resume. Once the TXT environment has been restored, it will restore the TPM PCRs and then transfer control back to the kernel's S3 resume vector. In order to preserve system integrity across S3, the kernel provides tboot with a set of memory ranges (RAM and RESERVED_KERN in the e820 table, but not any memory that BIOS might alter over the S3 transition) that tboot will calculate a MAC (message authentication code) over and then seal with the TPM. On resume and once the measured environment has been re-established, tboot will re-calculate the MAC and verify it against the sealed value. Tboot's policy determines what happens if the verification fails. Note that the c/s 194 of tboot which has the new MAC code supports this. That's pretty much it for TXT support. Configuring the System: ====================== This code works with 32bit, 32bit PAE, and 64bit (x86_64) kernels. In BIOS, the user must enable: TPM, TXT, VT-x, VT-d. Not all BIOSes allow these to be individually enabled/disabled and the screens in which to find them are BIOS-specific. grub.conf needs to be modified as follows: title Linux 2.6.29-tip w/ tboot root (hd0,0) kernel /tboot.gz logging=serial,vga,memory module /vmlinuz-2.6.29-tip intel_iommu=on ro root=LABEL=/ rhgb console=ttyS0,115200 3 module /initrd-2.6.29-tip.img module /Q35_SINIT_17.BIN The kernel option for enabling Intel TXT support is found under the Security top-level menu and is called "Enable Intel(R) Trusted Execution Technology (TXT)". It is marked as EXPERIMENTAL and depends on the generic x86 support (to allow maximum flexibility in kernel build options), since the tboot code will detect whether the platform actually supports Intel TXT and thus whether any of the kernel code is executed. The Q35_SINIT_17.BIN file is what Intel TXT refers to as an Authenticated Code Module. It is specific to the chipset in the system and can also be found on the Trusted Boot site. It is an (unencrypted) module signed by Intel that is used as part of the DRTM process to verify and configure the system. It is signed because it operates at a higher privilege level in the system than any other macrocode and its correct operation is critical to the establishment of the DRTM. The process for determining the correct SINIT ACM for a system is documented in the SINIT-guide.txt file that is on the tboot SourceForge site under the SINIT ACM downloads. Linux IOMMU Support =================== The architecture spec can be obtained from the below location. http://www.intel.com/technology/virtualization/ This guide gives a quick cheat sheet for some basic understanding. Some Keywords DMAR - DMA remapping DRHD - DMA Engine Reporting Structure RMRR - Reserved memory Region Reporting Structure ZLR - Zero length reads from PCI devices IOVA - IO Virtual address. Basic stuff ----------- ACPI enumerates and lists the different DMA engines in the platform, and device scope relationships between PCI devices and which DMA engine controls them. What is RMRR? ------------- There are some devices the BIOS controls, for e.g USB devices to perform PS2 emulation. The regions of memory used for these devices are marked reserved in the e820 map. When we turn on DMA translation, DMA to those regions will fail. Hence BIOS uses RMRR to specify these regions along with devices that need to access these regions. OS is expected to setup unity mappings for these regions for these devices to access these regions. How is IOVA generated? --------------------- Well behaved drivers call pci_map_*() calls before sending command to device that needs to perform DMA. Once DMA is completed and mapping is no longer required, device performs a pci_unmap_*() calls to unmap the region. The Intel IOMMU driver allocates a virtual address per domain. Each PCIE device has its own domain (hence protection). Devices under p2p bridges share the virtual address with all devices under the p2p bridge due to transaction id aliasing for p2p bridges. IOVA generation is pretty generic. We used the same technique as vmalloc() but these are not global address spaces, but separate for each domain. Different DMA engines may support different number of domains. We also allocate guard pages with each mapping, so we can attempt to catch any overflow that might happen. Graphics Problems? ------------------ If you encounter issues with graphics devices, you can try adding option intel_iommu=igfx_off to turn off the integrated graphics engine. If this fixes anything, please ensure you file a bug reporting the problem. Some exceptions to IOVA ----------------------- Interrupt ranges are not address translated, (0xfee00000 - 0xfeefffff). The same is true for peer to peer transactions. Hence we reserve the address from PCI MMIO ranges so they are not allocated for IOVA addresses. Fault reporting --------------- When errors are reported, the DMA engine signals via an interrupt. The fault reason and device that caused it with fault reason is printed on console. See below for sample. Boot Message Sample ------------------- Something like this gets printed indicating presence of DMAR tables in ACPI. ACPI: DMAR (v001 A M I OEMDMAR 0x00000001 MSFT 0x00000097) @ 0x000000007f5b5ef0 When DMAR is being processed and initialized by ACPI, prints DMAR locations and any RMRR's processed. ACPI DMAR:Host address width 36 ACPI DMAR:DRHD (flags: 0x00000000)base: 0x00000000fed90000 ACPI DMAR:DRHD (flags: 0x00000000)base: 0x00000000fed91000 ACPI DMAR:DRHD (flags: 0x00000001)base: 0x00000000fed93000 ACPI DMAR:RMRR base: 0x00000000000ed000 end: 0x00000000000effff ACPI DMAR:RMRR base: 0x000000007f600000 end: 0x000000007fffffff When DMAR is enabled for use, you will notice.. PCI-DMA: Using DMAR IOMMU Fault reporting --------------- DMAR:[DMA Write] Request device [00:02.0] fault addr 6df084000 DMAR:[fault reason 05] PTE Write access is not set DMAR:[DMA Write] Request device [00:02.0] fault addr 6df084000 DMAR:[fault reason 05] PTE Write access is not set TBD ---- - For compatibility testing, could use unity map domain for all devices, just provide a 1-1 for all useful memory under a single domain for all devices. - API for paravirt ops for abstracting functionality for VMM folks. On some platforms, so-called memory-mapped I/O is weakly ordered. On such platforms, driver writers are responsible for ensuring that I/O writes to memory-mapped addresses on their device arrive in the order intended. This is typically done by reading a 'safe' device or bridge register, causing the I/O chipset to flush pending writes to the device before any reads are posted. A driver would usually use this technique immediately prior to the exit of a critical section of code protected by spinlocks. This would ensure that subsequent writes to I/O space arrived only after all prior writes (much like a memory barrier op, mb(), only with respect to I/O). A more concrete example from a hypothetical device driver: ... CPU A: spin_lock_irqsave(&dev_lock, flags) CPU A: val = readl(my_status); CPU A: ... CPU A: writel(newval, ring_ptr); CPU A: spin_unlock_irqrestore(&dev_lock, flags) ... CPU B: spin_lock_irqsave(&dev_lock, flags) CPU B: val = readl(my_status); CPU B: ... CPU B: writel(newval2, ring_ptr); CPU B: spin_unlock_irqrestore(&dev_lock, flags) ... In the case above, the device may receive newval2 before it receives newval, which could cause problems. Fixing it is easy enough though: ... CPU A: spin_lock_irqsave(&dev_lock, flags) CPU A: val = readl(my_status); CPU A: ... CPU A: writel(newval, ring_ptr); CPU A: (void)readl(safe_register); /* maybe a config register? */ CPU A: spin_unlock_irqrestore(&dev_lock, flags) ... CPU B: spin_lock_irqsave(&dev_lock, flags) CPU B: val = readl(my_status); CPU B: ... CPU B: writel(newval2, ring_ptr); CPU B: (void)readl(safe_register); /* maybe a config register? */ CPU B: spin_unlock_irqrestore(&dev_lock, flags) Here, the reads from safe_register will cause the I/O chipset to flush any pending writes before actually posting the read to the chipset, preventing possible data corruption. The io_mapping functions in linux/io-mapping.h provide an abstraction for efficiently mapping small regions of an I/O device to the CPU. The initial usage is to support the large graphics aperture on 32-bit processors where ioremap_wc cannot be used to statically map the entire aperture to the CPU as it would consume too much of the kernel address space. A mapping object is created during driver initialization using struct io_mapping *io_mapping_create_wc(unsigned long base, unsigned long size) 'base' is the bus address of the region to be made mappable, while 'size' indicates how large a mapping region to enable. Both are in bytes. This _wc variant provides a mapping which may only be used with the io_mapping_map_atomic_wc or io_mapping_map_wc. With this mapping object, individual pages can be mapped either atomically or not, depending on the necessary scheduling environment. Of course, atomic maps are more efficient: void *io_mapping_map_atomic_wc(struct io_mapping *mapping, unsigned long offset) 'offset' is the offset within the defined mapping region. Accessing addresses beyond the region specified in the creation function yields undefined results. Using an offset which is not page aligned yields an undefined result. The return value points to a single page in CPU address space. This _wc variant returns a write-combining map to the page and may only be used with mappings created by io_mapping_create_wc Note that the task may not sleep while holding this page mapped. void io_mapping_unmap_atomic(void *vaddr) 'vaddr' must be the the value returned by the last io_mapping_map_atomic_wc call. This unmaps the specified page and allows the task to sleep once again. If you need to sleep while holding the lock, you can use the non-atomic variant, although they may be significantly slower. void *io_mapping_map_wc(struct io_mapping *mapping, unsigned long offset) This works like io_mapping_map_atomic_wc except it allows the task to sleep while holding the page mapped. void io_mapping_unmap(void *vaddr) This works like io_mapping_unmap_atomic, except it is used for pages mapped with io_mapping_map_wc. At driver close time, the io_mapping object must be freed: void io_mapping_free(struct io_mapping *mapping) Current Implementation: The initial implementation of these functions uses existing mapping mechanisms and so provides only an abstraction layer and no new functionality. On 64-bit processors, io_mapping_create_wc calls ioremap_wc for the whole range, creating a permanent kernel-visible mapping to the resource. The map_atomic and map functions add the requested offset to the base of the virtual address returned by ioremap_wc. On 32-bit processors with HIGHMEM defined, io_mapping_map_atomic_wc uses kmap_atomic_pfn to map the specified page in an atomic fashion; kmap_atomic_pfn isn't really supposed to be used with device pages, but it provides an efficient mapping for this usage. On 32-bit processors without HIGHMEM defined, io_mapping_map_atomic_wc and io_mapping_map_wc both use ioremap_wc, a terribly inefficient function which performs an IPI to inform all processors about the new mapping. This results in a significant performance penalty. I/O statistics fields --------------- Since 2.4.20 (and some versions before, with patches), and 2.5.45, more extensive disk statistics have been introduced to help measure disk activity. Tools such as sar and iostat typically interpret these and do the work for you, but in case you are interested in creating your own tools, the fields are explained here. In 2.4 now, the information is found as additional fields in /proc/partitions. In 2.6, the same information is found in two places: one is in the file /proc/diskstats, and the other is within the sysfs file system, which must be mounted in order to obtain the information. Throughout this document we'll assume that sysfs is mounted on /sys, although of course it may be mounted anywhere. Both /proc/diskstats and sysfs use the same source for the information and so should not differ. Here are examples of these different formats: 2.4: 3 0 39082680 hda 446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160 3 1 9221278 hda1 35486 0 35496 38030 0 0 0 0 0 38030 38030 2.6 sysfs: 446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160 35486 38030 38030 38030 2.6 diskstats: 3 0 hda 446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160 3 1 hda1 35486 38030 38030 38030 On 2.4 you might execute "grep 'hda ' /proc/partitions". On 2.6, you have a choice of "cat /sys/block/hda/stat" or "grep 'hda ' /proc/diskstats". The advantage of one over the other is that the sysfs choice works well if you are watching a known, small set of disks. /proc/diskstats may be a better choice if you are watching a large number of disks because you'll avoid the overhead of 50, 100, or 500 or more opens/closes with each snapshot of your disk statistics. In 2.4, the statistics fields are those after the device name. In the above example, the first field of statistics would be 446216. By contrast, in 2.6 if you look at /sys/block/hda/stat, you'll find just the eleven fields, beginning with 446216. If you look at /proc/diskstats, the eleven fields will be preceded by the major and minor device numbers, and device name. Each of these formats provides eleven fields of statistics, each meaning exactly the same things. All fields except field 9 are cumulative since boot. Field 9 should go to zero as I/Os complete; all others only increase (unless they overflow and wrap). Yes, these are (32-bit or 64-bit) unsigned long (native word size) numbers, and on a very busy or long-lived system they may wrap. Applications should be prepared to deal with that; unless your observations are measured in large numbers of minutes or hours, they should not wrap twice before you notice them. Each set of stats only applies to the indicated device; if you want system-wide stats you'll have to find all the devices and sum them all up. Field 1 -- # of reads completed This is the total number of reads completed successfully. Field 2 -- # of reads merged, field 6 -- # of writes merged Reads and writes which are adjacent to each other may be merged for efficiency. Thus two 4K reads may become one 8K read before it is ultimately handed to the disk, and so it will be counted (and queued) as only one I/O. This field lets you know how often this was done. Field 3 -- # of sectors read This is the total number of sectors read successfully. Field 4 -- # of milliseconds spent reading This is the total number of milliseconds spent by all reads (as measured from __make_request() to end_that_request_last()). Field 5 -- # of writes completed This is the total number of writes completed successfully. Field 7 -- # of sectors written This is the total number of sectors written successfully. Field 8 -- # of milliseconds spent writing This is the total number of milliseconds spent by all writes (as measured from __make_request() to end_that_request_last()). Field 9 -- # of I/Os currently in progress The only field that should go to zero. Incremented as requests are given to appropriate struct request_queue and decremented as they finish. Field 10 -- # of milliseconds spent doing I/Os This field increases so long as field 9 is nonzero. Field 11 -- weighted # of milliseconds spent doing I/Os This field is incremented at each I/O start, I/O completion, I/O merge, or read of these stats by the number of I/Os in progress (field 9) times the number of milliseconds spent doing I/O since the last update of this field. This can provide an easy measure of both I/O completion time and the backlog that may be accumulating. To avoid introducing performance bottlenecks, no locks are held while modifying these counters. This implies that minor inaccuracies may be introduced when changes collide, so (for instance) adding up all the read I/Os issued per partition should equal those made to the disks ... but due to the lack of locking it may only be very close. In 2.6, there are counters for each CPU, which make the lack of locking almost a non-issue. When the statistics are read, the per-CPU counters are summed (possibly overflowing the unsigned long variable they are summed to) and the result given to the user. There is no convenient user interface for accessing the per-CPU counters themselves. Disks vs Partitions ------------------- There were significant changes between 2.4 and 2.6 in the I/O subsystem. As a result, some statistic information disappeared. The translation from a disk address relative to a partition to the disk address relative to the host disk happens much earlier. All merges and timings now happen at the disk level rather than at both the disk and partition level as in 2.4. Consequently, you'll see a different statistics output on 2.6 for partitions from that for disks. There are only *four* fields available for partitions on 2.6 machines. This is reflected in the examples above. Field 1 -- # of reads issued This is the total number of reads issued to this partition. Field 2 -- # of sectors read This is the total number of sectors requested to be read from this partition. Field 3 -- # of writes issued This is the total number of writes issued to this partition. Field 4 -- # of sectors written This is the total number of sectors requested to be written to this partition. Note that since the address is translated to a disk-relative one, and no record of the partition-relative address is kept, the subsequent success or failure of the read cannot be attributed to the partition. In other words, the number of reads for partitions is counted slightly before time of queuing for partitions, and at completion for whole disks. This is a subtle distinction that is probably uninteresting for most cases. More significant is the error induced by counting the numbers of reads/writes before merges for partitions and after for disks. Since a typical workload usually contains a lot of successive and adjacent requests, the number of reads/writes issued can be several times higher than the number of reads/writes completed. In 2.6.25, the full statistic set is again available for partitions and disk and partition statistics are consistent again. Since we still don't keep record of the partition-relative address, an operation is attributed to the partition which contains the first sector of the request after the eventual merges. As requests can be merged across partition, this could lead to some (probably insignificant) inaccuracy. Additional notes ---------------- In 2.6, sysfs is not mounted by default. If your distribution of Linux hasn't added it already, here's the line you'll want to add to your /etc/fstab: none /sys sysfs defaults 0 0 In 2.6, all disk statistics were removed from /proc/stat. In 2.4, they appear in both /proc/partitions and /proc/stat, although the ones in /proc/stat take a very different format from those in /proc/partitions (see proc(5), if your system has it.) -- ricklind@us.ibm.com The Linux IPMI Driver --------------------- Corey Minyard The Intelligent Platform Management Interface, or IPMI, is a standard for controlling intelligent devices that monitor a system. It provides for dynamic discovery of sensors in the system and the ability to monitor the sensors and be informed when the sensor's values change or go outside certain boundaries. It also has a standardized database for field-replaceable units (FRUs) and a watchdog timer. To use this, you need an interface to an IPMI controller in your system (called a Baseboard Management Controller, or BMC) and management software that can use the IPMI system. This document describes how to use the IPMI driver for Linux. If you are not familiar with IPMI itself, see the web site at http://www.intel.com/design/servers/ipmi/index.htm. IPMI is a big subject and I can't cover it all here! Configuration ------------- The Linux IPMI driver is modular, which means you have to pick several things to have it work right depending on your hardware. Most of these are available in the 'Character Devices' menu then the IPMI menu. No matter what, you must pick 'IPMI top-level message handler' to use IPMI. What you do beyond that depends on your needs and hardware. The message handler does not provide any user-level interfaces. Kernel code (like the watchdog) can still use it. If you need access from userland, you need to select 'Device interface for IPMI' if you want access through a device driver. The driver interface depends on your hardware. If your system properly provides the SMBIOS info for IPMI, the driver will detect it and just work. If you have a board with a standard interface (These will generally be either "KCS", "SMIC", or "BT", consult your hardware manual), choose the 'IPMI SI handler' option. A driver also exists for direct I2C access to the IPMI management controller. Some boards support this, but it is unknown if it will work on every board. For this, choose 'IPMI SMBus handler', but be ready to try to do some figuring to see if it will work on your system if the SMBIOS/APCI information is wrong or not present. It is fairly safe to have both these enabled and let the drivers auto-detect what is present. You should generally enable ACPI on your system, as systems with IPMI can have ACPI tables describing them. If you have a standard interface and the board manufacturer has done their job correctly, the IPMI controller should be automatically detected (via ACPI or SMBIOS tables) and should just work. Sadly, many boards do not have this information. The driver attempts standard defaults, but they may not work. If you fall into this situation, you need to read the section below named 'The SI Driver' or "The SMBus Driver" on how to hand-configure your system. IPMI defines a standard watchdog timer. You can enable this with the 'IPMI Watchdog Timer' config option. If you compile the driver into the kernel, then via a kernel command-line option you can have the watchdog timer start as soon as it initializes. It also have a lot of other options, see the 'Watchdog' section below for more details. Note that you can also have the watchdog continue to run if it is closed (by default it is disabled on close). Go into the 'Watchdog Cards' menu, enable 'Watchdog Timer Support', and enable the option 'Disable watchdog shutdown on close'. IPMI systems can often be powered off using IPMI commands. Select 'IPMI Poweroff' to do this. The driver will auto-detect if the system can be powered off by IPMI. It is safe to enable this even if your system doesn't support this option. This works on ATCA systems, the Radisys CPI1 card, and any IPMI system that supports standard chassis management commands. If you want the driver to put an event into the event log on a panic, enable the 'Generate a panic event to all BMCs on a panic' option. If you want the whole panic string put into the event log using OEM events, enable the 'Generate OEM events containing the panic string' option. Basic Design ------------ The Linux IPMI driver is designed to be very modular and flexible, you only need to take the pieces you need and you can use it in many different ways. Because of that, it's broken into many chunks of code. These chunks (by module name) are: ipmi_msghandler - This is the central piece of software for the IPMI system. It handles all messages, message timing, and responses. The IPMI users tie into this, and the IPMI physical interfaces (called System Management Interfaces, or SMIs) also tie in here. This provides the kernelland interface for IPMI, but does not provide an interface for use by application processes. ipmi_devintf - This provides a userland IOCTL interface for the IPMI driver, each open file for this device ties in to the message handler as an IPMI user. ipmi_si - A driver for various system interfaces. This supports KCS, SMIC, and BT interfaces. Unless you have an SMBus interface or your own custom interface, you probably need to use this. ipmi_smb - A driver for accessing BMCs on the SMBus. It uses the I2C kernel driver's SMBus interfaces to send and receive IPMI messages over the SMBus. ipmi_watchdog - IPMI requires systems to have a very capable watchdog timer. This driver implements the standard Linux watchdog timer interface on top of the IPMI message handler. ipmi_poweroff - Some systems support the ability to be turned off via IPMI commands. These are all individually selectable via configuration options. Note that the KCS-only interface has been removed. The af_ipmi driver is no longer supported and has been removed because it was impossible to do 32 bit emulation on 64-bit kernels with it. Much documentation for the interface is in the include files. The IPMI include files are: net/af_ipmi.h - Contains the socket interface. linux/ipmi.h - Contains the user interface and IOCTL interface for IPMI. linux/ipmi_smi.h - Contains the interface for system management interfaces (things that interface to IPMI controllers) to use. linux/ipmi_msgdefs.h - General definitions for base IPMI messaging. Addressing ---------- The IPMI addressing works much like IP addresses, you have an overlay to handle the different address types. The overlay is: struct ipmi_addr { int addr_type; short channel; char data[IPMI_MAX_ADDR_SIZE]; }; The addr_type determines what the address really is. The driver currently understands two different types of addresses. "System Interface" addresses are defined as: struct ipmi_system_interface_addr { int addr_type; short channel; }; and the type is IPMI_SYSTEM_INTERFACE_ADDR_TYPE. This is used for talking straight to the BMC on the current card. The channel must be IPMI_BMC_CHANNEL. Messages that are destined to go out on the IPMB bus use the IPMI_IPMB_ADDR_TYPE address type. The format is struct ipmi_ipmb_addr { int addr_type; short channel; unsigned char slave_addr; unsigned char lun; }; The "channel" here is generally zero, but some devices support more than one channel, it corresponds to the channel as defined in the IPMI spec. Messages -------- Messages are defined as: struct ipmi_msg { unsigned char netfn; unsigned char lun; unsigned char cmd; unsigned char *data; int data_len; }; The driver takes care of adding/stripping the header information. The data portion is just the data to be send (do NOT put addressing info here) or the response. Note that the completion code of a response is the first item in "data", it is not stripped out because that is how all the messages are defined in the spec (and thus makes counting the offsets a little easier :-). When using the IOCTL interface from userland, you must provide a block of data for "data", fill it, and set data_len to the length of the block of data, even when receiving messages. Otherwise the driver will have no place to put the message. Messages coming up from the message handler in kernelland will come in as: struct ipmi_recv_msg { struct list_head link; /* The type of message as defined in the "Receive Types" defines above. */ int recv_type; ipmi_user_t *user; struct ipmi_addr addr; long msgid; struct ipmi_msg msg; /* Call this when done with the message. It will presumably free the message and do any other necessary cleanup. */ void (*done)(struct ipmi_recv_msg *msg); /* Place-holder for the data, don't make any assumptions about the size or existence of this, since it may change. */ unsigned char msg_data[IPMI_MAX_MSG_LENGTH]; }; You should look at the receive type and handle the message appropriately. The Upper Layer Interface (Message Handler) ------------------------------------------- The upper layer of the interface provides the users with a consistent view of the IPMI interfaces. It allows multiple SMI interfaces to be addressed (because some boards actually have multiple BMCs on them) and the user should not have to care what type of SMI is below them. Creating the User To user the message handler, you must first create a user using ipmi_create_user. The interface number specifies which SMI you want to connect to, and you must supply callback functions to be called when data comes in. The callback function can run at interrupt level, so be careful using the callbacks. This also allows to you pass in a piece of data, the handler_data, that will be passed back to you on all calls. Once you are done, call ipmi_destroy_user() to get rid of the user. From userland, opening the device automatically creates a user, and closing the device automatically destroys the user. Messaging To send a message from kernel-land, the ipmi_request() call does pretty much all message handling. Most of the parameter are self-explanatory. However, it takes a "msgid" parameter. This is NOT the sequence number of messages. It is simply a long value that is passed back when the response for the message is returned. You may use it for anything you like. Responses come back in the function pointed to by the ipmi_recv_hndl field of the "handler" that you passed in to ipmi_create_user(). Remember again, these may be running at interrupt level. Remember to look at the receive type, too. From userland, you fill out an ipmi_req_t structure and use the IPMICTL_SEND_COMMAND ioctl. For incoming stuff, you can use select() or poll() to wait for messages to come in. However, you cannot use read() to get them, you must call the IPMICTL_RECEIVE_MSG with the ipmi_recv_t structure to actually get the message. Remember that you must supply a pointer to a block of data in the msg.data field, and you must fill in the msg.data_len field with the size of the data. This gives the receiver a place to actually put the message. If the message cannot fit into the data you provide, you will get an EMSGSIZE error and the driver will leave the data in the receive queue. If you want to get it and have it truncate the message, us the IPMICTL_RECEIVE_MSG_TRUNC ioctl. When you send a command (which is defined by the lowest-order bit of the netfn per the IPMI spec) on the IPMB bus, the driver will automatically assign the sequence number to the command and save the command. If the response is not receive in the IPMI-specified 5 seconds, it will generate a response automatically saying the command timed out. If an unsolicited response comes in (if it was after 5 seconds, for instance), that response will be ignored. In kernelland, after you receive a message and are done with it, you MUST call ipmi_free_recv_msg() on it, or you will leak messages. Note that you should NEVER mess with the "done" field of a message, that is required to properly clean up the message. Note that when sending, there is an ipmi_request_supply_msgs() call that lets you supply the smi and receive message. This is useful for pieces of code that need to work even if the system is out of buffers (the watchdog timer uses this, for instance). You supply your own buffer and own free routines. This is not recommended for normal use, though, since it is tricky to manage your own buffers. Events and Incoming Commands The driver takes care of polling for IPMI events and receiving commands (commands are messages that are not responses, they are commands that other things on the IPMB bus have sent you). To receive these, you must register for them, they will not automatically be sent to you. To receive events, you must call ipmi_set_gets_events() and set the "val" to non-zero. Any events that have been received by the driver since startup will immediately be delivered to the first user that registers for events. After that, if multiple users are registered for events, they will all receive all events that come in. For receiving commands, you have to individually register commands you want to receive. Call ipmi_register_for_cmd() and supply the netfn and command name for each command you want to receive. You also specify a bitmask of the channels you want to receive the command from (or use IPMI_CHAN_ALL for all channels if you don't care). Only one user may be registered for each netfn/cmd/channel, but different users may register for different commands, or the same command if the channel bitmasks do not overlap. From userland, equivalent IOCTLs are provided to do these functions. The Lower Layer (SMI) Interface ------------------------------- As mentioned before, multiple SMI interfaces may be registered to the message handler, each of these is assigned an interface number when they register with the message handler. They are generally assigned in the order they register, although if an SMI unregisters and then another one registers, all bets are off. The ipmi_smi.h defines the interface for management interfaces, see that for more details. The SI Driver ------------- The SI driver allows up to 4 KCS or SMIC interfaces to be configured in the system. By default, scan the ACPI tables for interfaces, and if it doesn't find any the driver will attempt to register one KCS interface at the spec-specified I/O port 0xca2 without interrupts. You can change this at module load time (for a module) with: modprobe ipmi_si.o type=,.... ports=,... addrs=,... irqs=,... trydefaults=[0|1] regspacings=,,... regsizes=,,... regshifts=,,... slave_addrs=,,... force_kipmid=,,... kipmid_max_busy_us=,,... unload_when_empty=[0|1] Each of these except si_trydefaults is a list, the first item for the first interface, second item for the second interface, etc. The si_type may be either "kcs", "smic", or "bt". If you leave it blank, it defaults to "kcs". If you specify si_addrs as non-zero for an interface, the driver will use the memory address given as the address of the device. This overrides si_ports. If you specify si_ports as non-zero for an interface, the driver will use the I/O port given as the device address. If you specify si_irqs as non-zero for an interface, the driver will attempt to use the given interrupt for the device. si_trydefaults sets whether the standard IPMI interface at 0xca2 and any interfaces specified by ACPE are tried. By default, the driver tries it, set this value to zero to turn this off. The next three parameters have to do with register layout. The registers used by the interfaces may not appear at successive locations and they may not be in 8-bit registers. These parameters allow the layout of the data in the registers to be more precisely specified. The regspacings parameter give the number of bytes between successive register start addresses. For instance, if the regspacing is set to 4 and the start address is 0xca2, then the address for the second register would be 0xca6. This defaults to 1. The regsizes parameter gives the size of a register, in bytes. The data used by IPMI is 8-bits wide, but it may be inside a larger register. This parameter allows the read and write type to specified. It may be 1, 2, 4, or 8. The default is 1. Since the register size may be larger than 32 bits, the IPMI data may not be in the lower 8 bits. The regshifts parameter give the amount to shift the data to get to the actual IPMI data. The slave_addrs specifies the IPMI address of the local BMC. This is usually 0x20 and the driver defaults to that, but in case it's not, it can be specified when the driver starts up. The force_ipmid parameter forcefully enables (if set to 1) or disables (if set to 0) the kernel IPMI daemon. Normally this is auto-detected by the driver, but systems with broken interrupts might need an enable, or users that don't want the daemon (don't need the performance, don't want the CPU hit) can disable it. If unload_when_empty is set to 1, the driver will be unloaded if it doesn't find any interfaces or all the interfaces fail to work. The default is one. Setting to 0 is useful with the hotmod, but is obviously only useful for modules. When compiled into the kernel, the parameters can be specified on the kernel command line as: ipmi_si.type=,... ipmi_si.ports=,... ipmi_si.addrs=,... ipmi_si.irqs=,... ipmi_si.trydefaults=[0|1] ipmi_si.regspacings=,,... ipmi_si.regsizes=,,... ipmi_si.regshifts=,,... ipmi_si.slave_addrs=,,... ipmi_si.force_kipmid=,,... ipmi_si.kipmid_max_busy_us=,,... It works the same as the module parameters of the same names. By default, the driver will attempt to detect any device specified by ACPI, and if none of those then a KCS device at the spec-specified 0xca2. If you want to turn this off, set the "trydefaults" option to false. If your IPMI interface does not support interrupts and is a KCS or SMIC interface, the IPMI driver will start a kernel thread for the interface to help speed things up. This is a low-priority kernel thread that constantly polls the IPMI driver while an IPMI operation is in progress. The force_kipmid module parameter will all the user to force this thread on or off. If you force it off and don't have interrupts, the driver will run VERY slowly. Don't blame me, these interfaces suck. Unfortunately, this thread can use a lot of CPU depending on the interface's performance. This can waste a lot of CPU and cause various issues with detecting idle CPU and using extra power. To avoid this, the kipmid_max_busy_us sets the maximum amount of time, in microseconds, that kipmid will spin before sleeping for a tick. This value sets a balance between performance and CPU waste and needs to be tuned to your needs. Maybe, someday, auto-tuning will be added, but that's not a simple thing and even the auto-tuning would need to be tuned to the user's desired performance. The driver supports a hot add and remove of interfaces. This way, interfaces can be added or removed after the kernel is up and running. This is done using /sys/modules/ipmi_si/parameters/hotmod, which is a write-only parameter. You write a string to this interface. The string has the format: [:op2[:op3...]] The "op"s are: add|remove,kcs|bt|smic,mem|i/o,
[,[,[,...]]] You can specify more than one interface on the line. The "opt"s are: rsp= rsi= rsh= irq= ipmb= and these have the same meanings as discussed above. Note that you can also use this on the kernel command line for a more compact format for specifying an interface. Note that when removing an interface, only the first three parameters (si type, address type, and address) are used for the comparison. Any options are ignored for removing. The SMBus Driver ---------------- The SMBus driver allows up to 4 SMBus devices to be configured in the system. By default, the driver will register any SMBus interfaces it finds in the I2C address range of 0x20 to 0x4f on any adapter. You can change this at module load time (for a module) with: modprobe ipmi_smb.o addr=,[,,[,...]] dbg=,... [defaultprobe=1] [dbg_probe=1] The addresses are specified in pairs, the first is the adapter ID and the second is the I2C address on that adapter. The debug flags are bit flags for each BMC found, they are: IPMI messages: 1, driver state: 2, timing: 4, I2C probe: 8 Setting smb_defaultprobe to zero disabled the default probing of SMBus interfaces at address range 0x20 to 0x4f. This means that only the BMCs specified on the smb_addr line will be detected. Setting smb_dbg_probe to 1 will enable debugging of the probing and detection process for BMCs on the SMBusses. Discovering the IPMI compliant BMC on the SMBus can cause devices on the I2C bus to fail. The SMBus driver writes a "Get Device ID" IPMI message as a block write to the I2C bus and waits for a response. This action can be detrimental to some I2C devices. It is highly recommended that the known I2c address be given to the SMBus driver in the smb_addr parameter. The default address range will not be used when a smb_addr parameter is provided. When compiled into the kernel, the addresses can be specified on the kernel command line as: ipmb_smb.addr=,[,,[,...]] ipmi_smb.dbg=,... ipmi_smb.defaultprobe=0 ipmi_smb.dbg_probe=1 These are the same options as on the module command line. Note that you might need some I2C changes if CONFIG_IPMI_PANIC_EVENT is enabled along with this, so the I2C driver knows to run to completion during sending a panic event. Other Pieces ------------ Get the detailed info related with the IPMI device -------------------------------------------------- Some users need more detailed information about a device, like where the address came from or the raw base device for the IPMI interface. You can use the IPMI smi_watcher to catch the IPMI interfaces as they come or go, and to grab the information, you can use the function ipmi_get_smi_info(), which returns the following structure: struct ipmi_smi_info { enum ipmi_addr_src addr_src; struct device *dev; union { struct { void *acpi_handle; } acpi_info; } addr_info; }; Currently special info for only for SI_ACPI address sources is returned. Others may be added as necessary. Note that the dev pointer is included in the above structure, and assuming ipmi_smi_get_info returns success, you must call put_device on the dev pointer. Watchdog -------- A watchdog timer is provided that implements the Linux-standard watchdog timer interface. It has three module parameters that can be used to control it: modprobe ipmi_watchdog timeout= pretimeout= action= preaction= preop= start_now=x nowayout=x ifnum_to_use=n ifnum_to_use specifies which interface the watchdog timer should use. The default is -1, which means to pick the first one registered. The timeout is the number of seconds to the action, and the pretimeout is the amount of seconds before the reset that the pre-timeout panic will occur (if pretimeout is zero, then pretimeout will not be enabled). Note that the pretimeout is the time before the final timeout. So if the timeout is 50 seconds and the pretimeout is 10 seconds, then the pretimeout will occur in 40 second (10 seconds before the timeout). The action may be "reset", "power_cycle", or "power_off", and specifies what to do when the timer times out, and defaults to "reset". The preaction may be "pre_smi" for an indication through the SMI interface, "pre_int" for an indication through the SMI with an interrupts, and "pre_nmi" for a NMI on a preaction. This is how the driver is informed of the pretimeout. The preop may be set to "preop_none" for no operation on a pretimeout, "preop_panic" to set the preoperation to panic, or "preop_give_data" to provide data to read from the watchdog device when the pretimeout occurs. A "pre_nmi" setting CANNOT be used with "preop_give_data" because you can't do data operations from an NMI. When preop is set to "preop_give_data", one byte comes ready to read on the device when the pretimeout occurs. Select and fasync work on the device, as well. If start_now is set to 1, the watchdog timer will start running as soon as the driver is loaded. If nowayout is set to 1, the watchdog timer will not stop when the watchdog device is closed. The default value of nowayout is true if the CONFIG_WATCHDOG_NOWAYOUT option is enabled, or false if not. When compiled into the kernel, the kernel command line is available for configuring the watchdog: ipmi_watchdog.timeout= ipmi_watchdog.pretimeout= ipmi_watchdog.action= ipmi_watchdog.preaction= ipmi_watchdog.preop= ipmi_watchdog.start_now=x ipmi_watchdog.nowayout=x The options are the same as the module parameter options. The watchdog will panic and start a 120 second reset timeout if it gets a pre-action. During a panic or a reboot, the watchdog will start a 120 timer if it is running to make sure the reboot occurs. Note that if you use the NMI preaction for the watchdog, you MUST NOT use the nmi watchdog. There is no reasonable way to tell if an NMI comes from the IPMI controller, so it must assume that if it gets an otherwise unhandled NMI, it must be from IPMI and it will panic immediately. Once you open the watchdog timer, you must write a 'V' character to the device to close it, or the timer will not stop. This is a new semantic for the driver, but makes it consistent with the rest of the watchdog drivers in Linux. Panic Timeouts -------------- The OpenIPMI driver supports the ability to put semi-custom and custom events in the system event log if a panic occurs. if you enable the 'Generate a panic event to all BMCs on a panic' option, you will get one event on a panic in a standard IPMI event format. If you enable the 'Generate OEM events containing the panic string' option, you will also get a bunch of OEM events holding the panic string. The field settings of the events are: * Generator ID: 0x21 (kernel) * EvM Rev: 0x03 (this event is formatting in IPMI 1.0 format) * Sensor Type: 0x20 (OS critical stop sensor) * Sensor #: The first byte of the panic string (0 if no panic string) * Event Dir | Event Type: 0x6f (Assertion, sensor-specific event info) * Event Data 1: 0xa1 (Runtime stop in OEM bytes 2 and 3) * Event data 2: second byte of panic string * Event data 3: third byte of panic string See the IPMI spec for the details of the event layout. This event is always sent to the local management controller. It will handle routing the message to the right place Other OEM events have the following format: Record ID (bytes 0-1): Set by the SEL. Record type (byte 2): 0xf0 (OEM non-timestamped) byte 3: The slave address of the card saving the panic byte 4: A sequence number (starting at zero) The rest of the bytes (11 bytes) are the panic string. If the panic string is longer than 11 bytes, multiple messages will be sent with increasing sequence numbers. Because you cannot send OEM events using the standard interface, this function will attempt to find an SEL and add the events there. It will first query the capabilities of the local management controller. If it has an SEL, then they will be stored in the SEL of the local management controller. If not, and the local management controller is an event generator, the event receiver from the local management controller will be queried and the events sent to the SEL on that device. Otherwise, the events go nowhere since there is nowhere to send them. Poweroff -------- If the poweroff capability is selected, the IPMI driver will install a shutdown function into the standard poweroff function pointer. This is in the ipmi_poweroff module. When the system requests a powerdown, it will send the proper IPMI commands to do this. This is supported on several platforms. There is a module parameter named "poweroff_powercycle" that may either be zero (do a power down) or non-zero (do a power cycle, power the system off, then power it on in a few seconds). Setting ipmi_poweroff.poweroff_control=x will do the same thing on the kernel command line. The parameter is also available via the proc filesystem in /proc/sys/dev/ipmi/poweroff_powercycle. Note that if the system does not support power cycling, it will always do the power off. The "ifnum_to_use" parameter specifies which interface the poweroff code should use. The default is -1, which means to pick the first one registered. Note that if you have ACPI enabled, the system will prefer using ACPI to power off. What is an IRQ? An IRQ is an interrupt request from a device. Currently they can come in over a pin, or over a packet. Several devices may be connected to the same pin thus sharing an IRQ. An IRQ number is a kernel identifier used to talk about a hardware interrupt source. Typically this is an index into the global irq_desc array, but except for what linux/interrupt.h implements the details are architecture specific. An IRQ number is an enumeration of the possible interrupt sources on a machine. Typically what is enumerated is the number of input pins on all of the interrupt controller in the system. In the case of ISA what is enumerated are the 16 input pins on the two i8259 interrupt controllers. Architectures can assign additional meaning to the IRQ numbers, and are encouraged to in the case where there is any manual configuration of the hardware involved. The ISA IRQs are a classic example of assigning this kind of additional meaning. ChangeLog: Started by Ingo Molnar Update by Max Krasnyansky SMP IRQ affinity /proc/irq/IRQ#/smp_affinity and /proc/irq/IRQ#/smp_affinity_list specify which target CPUs are permitted for a given IRQ source. It's a bitmask (smp_affinity) or cpu list (smp_affinity_list) of allowed CPUs. It's not allowed to turn off all CPUs, and if an IRQ controller does not support IRQ affinity then the value will not change from the default of all cpus. /proc/irq/default_smp_affinity specifies default affinity mask that applies to all non-active IRQs. Once IRQ is allocated/activated its affinity bitmask will be set to the default mask. It can then be changed as described above. Default mask is 0xffffffff. Here is an example of restricting IRQ44 (eth1) to CPU0-3 then restricting it to CPU4-7 (this is an 8-CPU SMP box): [root@moon 44]# cd /proc/irq/44 [root@moon 44]# cat smp_affinity ffffffff [root@moon 44]# echo 0f > smp_affinity [root@moon 44]# cat smp_affinity 0000000f [root@moon 44]# ping -f h PING hell (195.4.7.3): 56 data bytes ... --- hell ping statistics --- 6029 packets transmitted, 6027 packets received, 0% packet loss round-trip min/avg/max = 0.1/0.1/0.4 ms [root@moon 44]# cat /proc/interrupts | grep 'CPU\|44:' CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 44: 1068 1785 1785 1783 0 0 0 0 IO-APIC-level eth1 As can be seen from the line above IRQ44 was delivered only to the first four processors (0-3). Now lets restrict that IRQ to CPU(4-7). [root@moon 44]# echo f0 > smp_affinity [root@moon 44]# cat smp_affinity 000000f0 [root@moon 44]# ping -f h PING hell (195.4.7.3): 56 data bytes .. --- hell ping statistics --- 2779 packets transmitted, 2777 packets received, 0% packet loss round-trip min/avg/max = 0.1/0.5/585.4 ms [root@moon 44]# cat /proc/interrupts | 'CPU\|44:' CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 44: 1068 1785 1785 1783 1784 1069 1070 1069 IO-APIC-level eth1 This time around IRQ44 was delivered only to the last four processors. i.e counters for the CPU0-3 did not change. Here is an example of limiting that same irq (44) to cpus 1024 to 1031: [root@moon 44]# echo 1024-1031 > smp_affinity [root@moon 44]# cat smp_affinity 1024-1031 Note that to do this with a bitmask would require 32 bitmasks of zero to follow the pertinent one. IRQ-flags state tracing started by Ingo Molnar the "irq-flags tracing" feature "traces" hardirq and softirq state, in that it gives interested subsystems an opportunity to be notified of every hardirqs-off/hardirqs-on, softirqs-off/softirqs-on event that happens in the kernel. CONFIG_TRACE_IRQFLAGS_SUPPORT is needed for CONFIG_PROVE_SPIN_LOCKING and CONFIG_PROVE_RW_LOCKING to be offered by the generic lock debugging code. Otherwise only CONFIG_PROVE_MUTEX_LOCKING and CONFIG_PROVE_RWSEM_LOCKING will be offered on an architecture - these are locking APIs that are not used in IRQ context. (the one exception for rwsems is worked around) architecture support for this is certainly not in the "trivial" category, because lots of lowlevel assembly code deal with irq-flags state changes. But an architecture can be irq-flags-tracing enabled in a rather straightforward and risk-free manner. Architectures that want to support this need to do a couple of code-organizational changes first: - move their irq-flags manipulation code from their asm/system.h header to asm/irqflags.h - rename local_irq_disable()/etc to raw_local_irq_disable()/etc. so that the linux/irqflags.h code can inject callbacks and can construct the real local_irq_disable()/etc APIs. - add and enable TRACE_IRQFLAGS_SUPPORT in their arch level Kconfig file and then a couple of functional changes are needed as well to implement irq-flags-tracing support: - in lowlevel entry code add (build-conditional) calls to the trace_hardirqs_off()/trace_hardirqs_on() functions. The lock validator closely guards whether the 'real' irq-flags matches the 'virtual' irq-flags state, and complains loudly (and turns itself off) if the two do not match. Usually most of the time for arch support for irq-flags-tracing is spent in this state: look at the lockdep complaint, try to figure out the assembly code we did not cover yet, fix and repeat. Once the system has booted up and works without a lockdep complaint in the irq-flags-tracing functions arch support is complete. - if the architecture has non-maskable interrupts then those need to be excluded from the irq-tracing [and lock validation] mechanism via lockdep_off()/lockdep_on(). in general there is no risk from having an incomplete irq-flags-tracing implementation in an architecture: lockdep will detect that and will turn itself off. I.e. the lock validator will still be reliable. There should be no crashes due to irq-tracing bugs. (except if the assembly changes break other code by modifying conditions or registers that shouldn't be) ISA Plug & Play support by Jaroslav Kysela ========================================================== Interface /proc/isapnp ====================== The interface has been removed. See pnp.txt for more details. Interface /proc/bus/isapnp ========================== This directory allows access to ISA PnP cards and logical devices. The regular files contain the contents of ISA PnP registers for a logical device. Java(tm) Binary Kernel Support for Linux v1.03 ---------------------------------------------- Linux beats them ALL! While all other OS's are TALKING about direct support of Java Binaries in the OS, Linux is doing it! You can execute Java applications and Java Applets just like any other program after you have done the following: 1) You MUST FIRST install the Java Developers Kit for Linux. The Java on Linux HOWTO gives the details on getting and installing this. This HOWTO can be found at: ftp://sunsite.unc.edu/pub/Linux/docs/HOWTO/Java-HOWTO You should also set up a reasonable CLASSPATH environment variable to use Java applications that make use of any nonstandard classes (not included in the same directory as the application itself). 2) You have to compile BINFMT_MISC either as a module or into the kernel (CONFIG_BINFMT_MISC) and set it up properly. If you choose to compile it as a module, you will have to insert it manually with modprobe/insmod, as kmod cannot easily be supported with binfmt_misc. Read the file 'binfmt_misc.txt' in this directory to know more about the configuration process. 3) Add the following configuration items to binfmt_misc (you should really have read binfmt_misc.txt now): support for Java applications: ':Java:M::\xca\xfe\xba\xbe::/usr/local/bin/javawrapper:' support for executable Jar files: ':ExecutableJAR:E::jar::/usr/local/bin/jarwrapper:' support for Java Applets: ':Applet:E::html::/usr/bin/appletviewer:' or the following, if you want to be more selective: ':Applet:M:: in the first line ('<' has to be the first character!) to let this work! For the compiled Java programs you need a wrapper script like the following (this is because Java is broken in case of the filename handling), again fix the path names, both in the script and in the above given configuration string. You, too, need the little program after the script. Compile like gcc -O2 -o javaclassname javaclassname.c and stick it to /usr/local/bin. Both the javawrapper shellscript and the javaclassname program were supplied by Colin J. Watson . ====================== Cut here =================== #!/bin/bash # /usr/local/bin/javawrapper - the wrapper for binfmt_misc/java if [ -z "$1" ]; then exec 1>&2 echo Usage: $0 class-file exit 1 fi CLASS=$1 FQCLASS=`/usr/local/bin/javaclassname $1` FQCLASSN=`echo $FQCLASS | sed -e 's/^.*\.\([^.]*\)$/\1/'` FQCLASSP=`echo $FQCLASS | sed -e 's-\.-/-g' -e 's-^[^/]*$--' -e 's-/[^/]*$--'` # for example: # CLASS=Test.class # FQCLASS=foo.bar.Test # FQCLASSN=Test # FQCLASSP=foo/bar unset CLASSBASE declare -i LINKLEVEL=0 while :; do if [ "`basename $CLASS .class`" == "$FQCLASSN" ]; then # See if this directory works straight off cd -L `dirname $CLASS` CLASSDIR=$PWD cd $OLDPWD if echo $CLASSDIR | grep -q "$FQCLASSP$"; then CLASSBASE=`echo $CLASSDIR | sed -e "s.$FQCLASSP$.."` break; fi # Try dereferencing the directory name cd -P `dirname $CLASS` CLASSDIR=$PWD cd $OLDPWD if echo $CLASSDIR | grep -q "$FQCLASSP$"; then CLASSBASE=`echo $CLASSDIR | sed -e "s.$FQCLASSP$.."` break; fi # If no other possible filename exists if [ ! -L $CLASS ]; then exec 1>&2 echo $0: echo " $CLASS should be in a" \ "directory tree called $FQCLASSP" exit 1 fi fi if [ ! -L $CLASS ]; then break; fi # Go down one more level of symbolic links let LINKLEVEL+=1 if [ $LINKLEVEL -gt 5 ]; then exec 1>&2 echo $0: echo " Too many symbolic links encountered" exit 1 fi CLASS=`ls --color=no -l $CLASS | sed -e 's/^.* \([^ ]*\)$/\1/'` done if [ -z "$CLASSBASE" ]; then if [ -z "$FQCLASSP" ]; then GOODNAME=$FQCLASSN.class else GOODNAME=$FQCLASSP/$FQCLASSN.class fi exec 1>&2 echo $0: echo " $FQCLASS should be in a file called $GOODNAME" exit 1 fi if ! echo $CLASSPATH | grep -q "^\(.*:\)*$CLASSBASE\(:.*\)*"; then # class is not in CLASSPATH, so prepend dir of class to CLASSPATH if [ -z "${CLASSPATH}" ] ; then export CLASSPATH=$CLASSBASE else export CLASSPATH=$CLASSBASE:$CLASSPATH fi fi shift /usr/bin/java $FQCLASS "$@" ====================== Cut here =================== ====================== Cut here =================== /* javaclassname.c * * Extracts the class name from a Java class file; intended for use in a Java * wrapper of the type supported by the binfmt_misc option in the Linux kernel. * * Copyright (C) 1999 Colin J. Watson . * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA */ #include #include #include #include /* From Sun's Java VM Specification, as tag entries in the constant pool. */ #define CP_UTF8 1 #define CP_INTEGER 3 #define CP_FLOAT 4 #define CP_LONG 5 #define CP_DOUBLE 6 #define CP_CLASS 7 #define CP_STRING 8 #define CP_FIELDREF 9 #define CP_METHODREF 10 #define CP_INTERFACEMETHODREF 11 #define CP_NAMEANDTYPE 12 /* Define some commonly used error messages */ #define seek_error() error("%s: Cannot seek\n", program) #define corrupt_error() error("%s: Class file corrupt\n", program) #define eof_error() error("%s: Unexpected end of file\n", program) #define utf8_error() error("%s: Only ASCII 1-255 supported\n", program); char *program; long *pool; u_int8_t read_8(FILE *classfile); u_int16_t read_16(FILE *classfile); void skip_constant(FILE *classfile, u_int16_t *cur); void error(const char *format, ...); int main(int argc, char **argv); /* Reads in an unsigned 8-bit integer. */ u_int8_t read_8(FILE *classfile) { int b = fgetc(classfile); if(b == EOF) eof_error(); return (u_int8_t)b; } /* Reads in an unsigned 16-bit integer. */ u_int16_t read_16(FILE *classfile) { int b1, b2; b1 = fgetc(classfile); if(b1 == EOF) eof_error(); b2 = fgetc(classfile); if(b2 == EOF) eof_error(); return (u_int16_t)((b1 << 8) | b2); } /* Reads in a value from the constant pool. */ void skip_constant(FILE *classfile, u_int16_t *cur) { u_int16_t len; int seekerr = 1; pool[*cur] = ftell(classfile); switch(read_8(classfile)) { case CP_UTF8: len = read_16(classfile); seekerr = fseek(classfile, len, SEEK_CUR); break; case CP_CLASS: case CP_STRING: seekerr = fseek(classfile, 2, SEEK_CUR); break; case CP_INTEGER: case CP_FLOAT: case CP_FIELDREF: case CP_METHODREF: case CP_INTERFACEMETHODREF: case CP_NAMEANDTYPE: seekerr = fseek(classfile, 4, SEEK_CUR); break; case CP_LONG: case CP_DOUBLE: seekerr = fseek(classfile, 8, SEEK_CUR); ++(*cur); break; default: corrupt_error(); } if(seekerr) seek_error(); } void error(const char *format, ...) { va_list ap; va_start(ap, format); vfprintf(stderr, format, ap); va_end(ap); exit(1); } int main(int argc, char **argv) { FILE *classfile; u_int16_t cp_count, i, this_class, classinfo_ptr; u_int8_t length; program = argv[0]; if(!argv[1]) error("%s: Missing input file\n", program); classfile = fopen(argv[1], "rb"); if(!classfile) error("%s: Error opening %s\n", program, argv[1]); if(fseek(classfile, 8, SEEK_SET)) /* skip magic and version numbers */ seek_error(); cp_count = read_16(classfile); pool = calloc(cp_count, sizeof(long)); if(!pool) error("%s: Out of memory for constant pool\n", program); for(i = 1; i < cp_count; ++i) skip_constant(classfile, &i); if(fseek(classfile, 2, SEEK_CUR)) /* skip access flags */ seek_error(); this_class = read_16(classfile); if(this_class < 1 || this_class >= cp_count) corrupt_error(); if(!pool[this_class] || pool[this_class] == -1) corrupt_error(); if(fseek(classfile, pool[this_class] + 1, SEEK_SET)) seek_error(); classinfo_ptr = read_16(classfile); if(classinfo_ptr < 1 || classinfo_ptr >= cp_count) corrupt_error(); if(!pool[classinfo_ptr] || pool[classinfo_ptr] == -1) corrupt_error(); if(fseek(classfile, pool[classinfo_ptr] + 1, SEEK_SET)) seek_error(); length = read_16(classfile); for(i = 0; i < length; ++i) { u_int8_t x = read_8(classfile); if((x & 0x80) || !x) { if((x & 0xE0) == 0xC0) { u_int8_t y = read_8(classfile); if((y & 0xC0) == 0x80) { int c = ((x & 0x1f) << 6) + (y & 0x3f); if(c) putchar(c); else utf8_error(); } else utf8_error(); } else utf8_error(); } else if(x == '/') putchar('.'); else putchar(x); } putchar('\n'); free(pool); fclose(classfile); return 0; } ====================== Cut here =================== ====================== Cut here =================== #!/bin/bash # /usr/local/java/bin/jarwrapper - the wrapper for binfmt_misc/jar java -jar $1 ====================== Cut here =================== Now simply chmod +x the .class, .jar and/or .html files you want to execute. To add a Java program to your path best put a symbolic link to the main .class file into /usr/bin (or another place you like) omitting the .class extension. The directory containing the original .class file will be added to your CLASSPATH during execution. To test your new setup, enter in the following simple Java app, and name it "HelloWorld.java": class HelloWorld { public static void main(String args[]) { System.out.println("Hello World!"); } } Now compile the application with: javac HelloWorld.java Set the executable permissions of the binary file, with: chmod 755 HelloWorld.class And then execute it: ./HelloWorld.class To execute Java Jar files, simple chmod the *.jar files to include the execution bit, then just do ./Application.jar To execute Java Applets, simple chmod the *.html files to include the execution bit, then just do ./Applet.html originally by Brian A. Lantz, brian@lantz.com heavily edited for binfmt_misc by Richard Gunther new scripts by Colin J. Watson added executable Jar file support by Kurt Huwig kernel-doc nano-HOWTO ===================== How to format kernel-doc comments --------------------------------- In order to provide embedded, 'C' friendly, easy to maintain, but consistent and extractable documentation of the functions and data structures in the Linux kernel, the Linux kernel has adopted a consistent style for documenting functions and their parameters, and structures and their members. The format for this documentation is called the kernel-doc format. It is documented in this Documentation/kernel-doc-nano-HOWTO.txt file. This style embeds the documentation within the source files, using a few simple conventions. The scripts/kernel-doc perl script, some SGML templates in Documentation/DocBook, and other tools understand these conventions, and are used to extract this embedded documentation into various documents. In order to provide good documentation of kernel functions and data structures, please use the following conventions to format your kernel-doc comments in Linux kernel source. We definitely need kernel-doc formatted documentation for functions that are exported to loadable modules using EXPORT_SYMBOL. We also look to provide kernel-doc formatted documentation for functions externally visible to other kernel files (not marked "static"). We also recommend providing kernel-doc formatted documentation for private (file "static") routines, for consistency of kernel source code layout. But this is lower priority and at the discretion of the MAINTAINER of that kernel source file. Data structures visible in kernel include files should also be documented using kernel-doc formatted comments. The opening comment mark "/**" is reserved for kernel-doc comments. Only comments so marked will be considered by the kernel-doc scripts, and any comment so marked must be in kernel-doc format. Do not use "/**" to be begin a comment block unless the comment block contains kernel-doc formatted comments. The closing comment marker for kernel-doc comments can be either "*/" or "**/", but "*/" is preferred in the Linux kernel tree. Kernel-doc comments should be placed just before the function or data structure being described. Example kernel-doc function comment: /** * foobar() - short function description of foobar * @arg1: Describe the first argument to foobar. * @arg2: Describe the second argument to foobar. * One can provide multiple line descriptions * for arguments. * * A longer description, with more discussion of the function foobar() * that might be useful to those using or modifying it. Begins with * empty comment line, and may include additional embedded empty * comment lines. * * The longer description can have multiple paragraphs. */ The short description following the subject can span multiple lines and ends with an @argument description, an empty line or the end of the comment block. The @argument descriptions must begin on the very next line following this opening short function description line, with no intervening empty comment lines. If a function parameter is "..." (varargs), it should be listed in kernel-doc notation as: * @...: description Example kernel-doc data structure comment. /** * struct blah - the basic blah structure * @mem1: describe the first member of struct blah * @mem2: describe the second member of struct blah, * perhaps with more lines and words. * * Longer description of this structure. */ The kernel-doc function comments describe each parameter to the function, in order, with the @name lines. The kernel-doc data structure comments describe each structure member in the data structure, with the @name lines. The longer description formatting is "reflowed", losing your line breaks. So presenting carefully formatted lists within these descriptions won't work so well; derived documentation will lose the formatting. See the section below "How to add extractable documentation to your source files" for more details and notes on how to format kernel-doc comments. Components of the kernel-doc system ----------------------------------- Many places in the source tree have extractable documentation in the form of block comments above functions. The components of this system are: - scripts/kernel-doc This is a perl script that hunts for the block comments and can mark them up directly into DocBook, man, text, and HTML. (No, not texinfo.) - Documentation/DocBook/*.tmpl These are SGML template files, which are normal SGML files with special place-holders for where the extracted documentation should go. - scripts/basic/docproc.c This is a program for converting SGML template files into SGML files. When a file is referenced it is searched for symbols exported (EXPORT_SYMBOL), to be able to distinguish between internal and external functions. It invokes kernel-doc, giving it the list of functions that are to be documented. Additionally it is used to scan the SGML template files to locate all the files referenced herein. This is used to generate dependency information as used by make. - Makefile The targets 'sgmldocs', 'psdocs', 'pdfdocs', and 'htmldocs' are used to build DocBook files, PostScript files, PDF files, and html files in Documentation/DocBook. - Documentation/DocBook/Makefile This is where C files are associated with SGML templates. How to extract the documentation -------------------------------- If you just want to read the ready-made books on the various subsystems (see Documentation/DocBook/*.tmpl), just type 'make psdocs', or 'make pdfdocs', or 'make htmldocs', depending on your preference. If you would rather read a different format, you can type 'make sgmldocs' and then use DocBook tools to convert Documentation/DocBook/*.sgml to a format of your choice (for example, 'db2html ...' if 'make htmldocs' was not defined). If you want to see man pages instead, you can do this: $ cd linux $ scripts/kernel-doc -man $(find -name '*.c') | split-man.pl /tmp/man $ scripts/kernel-doc -man $(find -name '*.h') | split-man.pl /tmp/man Here is split-man.pl: --> #!/usr/bin/perl if ($#ARGV < 0) { die "where do I put the results?\n"; } mkdir $ARGV[0],0777; $state = 0; while () { if (/^\.TH \"[^\"]*\" 9 \"([^\"]*)\"/) { if ($state == 1) { close OUT } $state = 1; $fn = "$ARGV[0]/$1.9"; print STDERR "Creating $fn\n"; open OUT, ">$fn" or die "can't open $fn: $!\n"; print OUT $_; } elsif ($state != 0) { print OUT $_; } } close OUT; <-- If you just want to view the documentation for one function in one file, you can do this: $ scripts/kernel-doc -man -function fn file | nroff -man | less or this: $ scripts/kernel-doc -text -function fn file How to add extractable documentation to your source files --------------------------------------------------------- The format of the block comment is like this: /** * function_name(:)? (- short description)? (* @parameterx(space)*: (description of parameter x)?)* (* a blank line)? * (Description:)? (Description of function)? * (section header: (section description)? )* (*)?*/ All "description" text can span multiple lines, although the function_name & its short description are traditionally on a single line. Description text may also contain blank lines (i.e., lines that contain only a "*"). "section header:" names must be unique per function (or struct, union, typedef, enum). Avoid putting a spurious blank line after the function name, or else the description will be repeated! All descriptive text is further processed, scanning for the following special patterns, which are highlighted appropriately. 'funcname()' - function '$ENVVAR' - environment variable '&struct_name' - name of a structure (up to two words including 'struct') '@parameter' - name of a parameter '%CONST' - name of a constant. NOTE 1: The multi-line descriptive text you provide does *not* recognize line breaks, so if you try to format some text nicely, as in: Return codes 0 - cool 1 - invalid arg 2 - out of memory this will all run together and produce: Return codes 0 - cool 1 - invalid arg 2 - out of memory NOTE 2: If the descriptive text you provide has lines that begin with some phrase followed by a colon, each of those phrases will be taken as a new section heading, which means you should similarly try to avoid text like: Return codes: 0: cool 1: invalid arg 2: out of memory every line of which would start a new section. Again, probably not what you were after. Take a look around the source tree for examples. kernel-doc for structs, unions, enums, and typedefs --------------------------------------------------- Beside functions you can also write documentation for structs, unions, enums and typedefs. Instead of the function name you must write the name of the declaration; the struct/union/enum/typedef must always precede the name. Nesting of declarations is not supported. Use the argument mechanism to document members or constants. Inside a struct description, you can use the "private:" and "public:" comment tags. Structure fields that are inside a "private:" area are not listed in the generated output documentation. The "private:" and "public:" tags must begin immediately following a "/*" comment marker. They may optionally include comments between the ":" and the ending "*/" marker. Example: /** * struct my_struct - short description * @a: first member * @b: second member * * Longer description */ struct my_struct { int a; int b; /* private: internal use only */ int c; }; Including documentation blocks in source files ---------------------------------------------- To facilitate having source code and comments close together, you can include kernel-doc documentation blocks that are free-form comments instead of being kernel-doc for functions, structures, unions, enums, or typedefs. This could be used for something like a theory of operation for a driver or library code, for example. This is done by using a DOC: section keyword with a section title. E.g.: /** * DOC: Theory of Operation * * The whizbang foobar is a dilly of a gizmo. It can do whatever you * want it to do, at any time. It reads your mind. Here's how it works. * * foo bar splat * * The only drawback to this gizmo is that is can sometimes damage * hardware, software, or its subject(s). */ DOC: sections are used in SGML templates files as indicated below. How to make new SGML template files ----------------------------------- SGML template files (*.tmpl) are like normal SGML files, except that they can contain escape sequences where extracted documentation should be inserted. !E is replaced by the documentation, in , for functions that are exported using EXPORT_SYMBOL: the function list is collected from files listed in Documentation/DocBook/Makefile. !I is replaced by the documentation for functions that are _not_ exported using EXPORT_SYMBOL. !D is used to name additional files to search for functions exported using EXPORT_SYMBOL. !F is replaced by the documentation, in , for the functions listed. !P
is replaced by the contents of the DOC: section titled
from . Spaces are allowed in
; do not quote the
. !C is replaced by nothing, but makes the tools check that all DOC: sections and documented functions, symbols, etc. are used. This makes sense to use when you use !F/!P only and want to verify that all documentation is included. Tim. */ Index of Documentation for People Interested in Writing and/or Understanding the Linux Kernel. Juan-Mariano de Goyeneche /* * The latest version of this document may be found at: * http://www.dit.upm.es/~jmseyas/linux/kernel/hackers-docs.html */ The need for a document like this one became apparent in the linux-kernel mailing list as the same questions, asking for pointers to information, appeared again and again. Fortunately, as more and more people get to GNU/Linux, more and more get interested in the Kernel. But reading the sources is not always enough. It is easy to understand the code, but miss the concepts, the philosophy and design decisions behind this code. Unfortunately, not many documents are available for beginners to start. And, even if they exist, there was no "well-known" place which kept track of them. These lines try to cover this lack. All documents available on line known by the author are listed, while some reference books are also mentioned. PLEASE, if you know any paper not listed here or write a new document, send me an e-mail, and I'll include a reference to it here. Any corrections, ideas or comments are also welcomed. The papers that follow are listed in no particular order. All are cataloged with the following fields: the document's "Title", the "Author"/s, the "URL" where they can be found, some "Keywords" helpful when searching for specific topics, and a brief "Description" of the Document. Enjoy! ON-LINE DOCS: * Title: "Linux Device Drivers, Third Edition" Author: Jonathan Corbet, Alessandro Rubini, Greg Kroah-Hartman URL: http://lwn.net/Kernel/LDD3/ Description: A 600-page book covering the (2.6.10) driver programming API and kernel hacking in general. Available under the Creative Commons Attribution-ShareAlike 2.0 license. * Title: "The Linux Kernel" Author: David A. Rusling. URL: http://www.tldp.org/LDP/tlk/tlk.html Keywords: everything!, book. Description: On line, 200 pages book describing most aspects of the Linux Kernel. Probably, the first reference for beginners. Lots of illustrations explaining data structures use and relationships in the purest Richard W. Stevens' style. Contents: "1.-Hardware Basics, 2.-Software Basics, 3.-Memory Management, 4.-Processes, 5.-Interprocess Communication Mechanisms, 6.-PCI, 7.-Interrupts and Interrupt Handling, 8.-Device Drivers, 9.-The File system, 10.-Networks, 11.-Kernel Mechanisms, 12.-Modules, 13.-The Linux Kernel Sources, A.-Linux Data Structures, B.-The Alpha AXP Processor, C.-Useful Web and FTP Sites, D.-The GNU General Public License, Glossary". In short: a must have. * Title: "Linux Device Drivers, 2nd Edition" Author: Alessandro Rubini and Jonathan Corbet. URL: http://www.xml.com/ldd/chapter/book/index.html Keywords: device drivers, modules, debugging, memory, hardware, interrupt handling, char drivers, block drivers, kmod, mmap, DMA, buses. Description: O'Reilly's popular book, now also on-line under the GNU Free Documentation License. Notes: You can also buy it in paper-form from O'Reilly. See below under BOOKS (Not on-line). * Title: "Conceptual Architecture of the Linux Kernel" Author: Ivan T. Bowman. URL: http://plg.uwaterloo.ca/ Keywords: conceptual software architecture, extracted design, reverse engineering, system structure. Description: Conceptual software architecture of the Linux kernel, automatically extracted from the source code. Very detailed. Good figures. Gives good overall kernel understanding. * Title: "Concrete Architecture of the Linux Kernel" Author: Ivan T. Bowman, Saheem Siddiqi, and Meyer C. Tanuan. URL: http://plg.uwaterloo.ca/ Keywords: concrete architecture, extracted design, reverse engineering, system structure, dependencies. Description: Concrete architecture of the Linux kernel, automatically extracted from the source code. Very detailed. Good figures. Gives good overall kernel understanding. This papers focus on lower details than its predecessor (files, variables...). * Title: "Linux as a Case Study: Its Extracted Software Architecture" Author: Ivan T. Bowman, Richard C. Holt and Neil V. Brewster. URL: http://plg.uwaterloo.ca/ Keywords: software architecture, architecture recovery, redocumentation. Description: Paper appeared at ICSE'99, Los Angeles, May 16-22, 1999. A mixture of the previous two documents from the same author. * Title: "Overview of the Virtual File System" Author: Richard Gooch. URL: http://www.mjmwired.net/kernel/Documentation/filesystems/vfs.txt Keywords: VFS, File System, mounting filesystems, opening files, dentries, dcache. Description: Brief introduction to the Linux Virtual File System. What is it, how it works, operations taken when opening a file or mounting a file system and description of important data structures explaining the purpose of each of their entries. * Title: "The Linux RAID-1, 4, 5 Code" Author: Ingo Molnar, Gadi Oxman and Miguel de Icaza. URL: http://www.linuxjournal.com/article.php?sid=2391 Keywords: RAID, MD driver. Description: Linux Journal Kernel Korner article. Here is its abstract: "A description of the implementation of the RAID-1, RAID-4 and RAID-5 personalities of the MD device driver in the Linux kernel, providing users with high performance and reliable, secondary-storage capability using software". * Title: "Dynamic Kernels: Modularized Device Drivers" Author: Alessandro Rubini. URL: http://www.linuxjournal.com/article.php?sid=1219 Keywords: device driver, module, loading/unloading modules, allocating resources. Description: Linux Journal Kernel Korner article. Here is its abstract: "This is the first of a series of four articles co-authored by Alessandro Rubini and Georg Zezchwitz which present a practical approach to writing Linux device drivers as kernel loadable modules. This installment presents an introduction to the topic, preparing the reader to understand next month's installment". * Title: "Dynamic Kernels: Discovery" Author: Alessandro Rubini. URL: http://www.linuxjournal.com/article.php?sid=1220 Keywords: character driver, init_module, clean_up module, autodetection, mayor number, minor number, file operations, open(), close(). Description: Linux Journal Kernel Korner article. Here is its abstract: "This article, the second of four, introduces part of the actual code to create custom module implementing a character device driver. It describes the code for module initialization and cleanup, as well as the open() and close() system calls". * Title: "The Devil's in the Details" Author: Georg v. Zezschwitz and Alessandro Rubini. URL: http://www.linuxjournal.com/article.php?sid=1221 Keywords: read(), write(), select(), ioctl(), blocking/non blocking mode, interrupt handler. Description: Linux Journal Kernel Korner article. Here is its abstract: "This article, the third of four on writing character device drivers, introduces concepts of reading, writing, and using ioctl-calls". * Title: "Dissecting Interrupts and Browsing DMA" Author: Alessandro Rubini and Georg v. Zezschwitz. URL: http://www.linuxjournal.com/article.php?sid=1222 Keywords: interrupts, irqs, DMA, bottom halves, task queues. Description: Linux Journal Kernel Korner article. Here is its abstract: "This is the fourth in a series of articles about writing character device drivers as loadable kernel modules. This month, we further investigate the field of interrupt handling. Though it is conceptually simple, practical limitations and constraints make this an ``interesting'' part of device driver writing, and several different facilities have been provided for different situations. We also investigate the complex topic of DMA". * Title: "Device Drivers Concluded" Author: Georg v. Zezschwitz. URL: http://www.linuxjournal.com/article.php?sid=1287 Keywords: address spaces, pages, pagination, page management, demand loading, swapping, memory protection, memory mapping, mmap, virtual memory areas (VMAs), vremap, PCI. Description: Finally, the above turned out into a five articles series. This latest one's introduction reads: "This is the last of five articles about character device drivers. In this final section, Georg deals with memory mapping devices, beginning with an overall description of the Linux memory management concepts". * Title: "Network Buffers And Memory Management" Author: Alan Cox. URL: http://www.linuxjournal.com/article.php?sid=1312 Keywords: sk_buffs, network devices, protocol/link layer variables, network devices flags, transmit, receive, configuration, multicast. Description: Linux Journal Kernel Korner. Here is the abstract: "Writing a network device driver for Linux is fundamentally simple---most of the complexity (other than talking to the hardware) involves managing network packets in memory". * Title: "Writing Linux Device Drivers" Author: Michael K. Johnson. URL: http://users.evitech.fi/~tk/rtos/writing_linux_device_d.html Keywords: files, VFS, file operations, kernel interface, character vs block devices, I/O access, hardware interrupts, DMA, access to user memory, memory allocation, timers. Description: Introductory 50-minutes (sic) tutorial on writing device drivers. 12 pages written by the same author of the "Kernel Hackers' Guide" which give a very good overview of the topic. * Title: "The Venus kernel interface" Author: Peter J. Braam. URL: http://www.coda.cs.cmu.edu/doc/html/kernel-venus-protocol.html Keywords: coda, filesystem, venus, cache manager. Description: "This document describes the communication between Venus and kernel level file system code needed for the operation of the Coda filesystem. This version document is meant to describe the current interface (version 1.0) as well as improvements we envisage". * Title: "Programming PCI-Devices under Linux" Author: Claus Schroeter. URL: ftp://ftp.llp.fu-berlin.de/pub/linux/LINUX-LAB/whitepapers/pcip.ps.gz Keywords: PCI, device, busmastering. Description: 6 pages tutorial on PCI programming under Linux. Gives the basic concepts on the architecture of the PCI subsystem, as long as basic functions and macros to read/write the devices and perform busmastering. * Title: "Writing Character Device Driver for Linux" Author: R. Baruch and C. Schroeter. URL: ftp://ftp.llp.fu-berlin.de/pub/linux/LINUX-LAB/whitepapers/drivers.ps.gz Keywords: character device drivers, I/O, signals, DMA, accessing ports in user space, kernel environment. Description: 68 pages paper on writing character drivers. A little bit old (1.993, 1.994) although still useful. * Title: "Design and Implementation of the Second Extended Filesystem" Author: Remy Card, Theodore Ts'o, Stephen Tweedie. URL: http://web.mit.edu/tytso/www/linux/ext2intro.html Keywords: ext2, linux fs history, inode, directory, link, devices, VFS, physical structure, performance, benchmarks, ext2fs library, ext2fs tools, e2fsck. Description: Paper written by three of the top ext2 hackers. Covers Linux filesystems history, ext2 motivation, ext2 features, design, physical structure on disk, performance, benchmarks, e2fsck's passes description... A must read! Notes: This paper was first published in the Proceedings of the First Dutch International Symposium on Linux, ISBN 90-367-0385-9. * Title: "Analysis of the Ext2fs structure" Author: Louis-Dominique Dubeau. URL: http://www.nondot.org/sabre/os/files/FileSystems/ext2fs/ Keywords: ext2, filesystem, ext2fs. Description: Description of ext2's blocks, directories, inodes, bitmaps, invariants... * Title: "Journaling the Linux ext2fs Filesystem" Author: Stephen C. Tweedie. URL: ftp://ftp.uk.linux.org/pub/linux/sct/fs/jfs/journal-design.ps.gz Keywords: ext3, journaling. Description: Excellent 8-pages paper explaining the journaling capabilities added to ext2 by the author, showing different problems faced and the alternatives chosen. * Title: "Kernel API changes from 2.0 to 2.2" Author: Richard Gooch. URL: http://www.linuxhq.com/guides/LKMPG/node28.html Keywords: 2.2, changes. Description: Kernel functions/structures/variables which changed from 2.0.x to 2.2.x. * Title: "Kernel API changes from 2.2 to 2.4" Author: Richard Gooch. Keywords: 2.4, changes. Description: Kernel functions/structures/variables which changed from 2.2.x to 2.4.x. * Title: "Linux Kernel Module Programming Guide" Author: Ori Pomerantz. URL: http://tldp.org/LDP/lkmpg/2.6/html/index.html Keywords: modules, GPL book, /proc, ioctls, system calls, interrupt handlers . Description: Very nice 92 pages GPL book on the topic of modules programming. Lots of examples. * Title: "I/O Event Handling Under Linux" Author: Richard Gooch. Keywords: IO, I/O, select(2), poll(2), FDs, aio_read(2), readiness event queues. Description: From the Introduction: "I/O Event handling is about how your Operating System allows you to manage a large number of open files (file descriptors in UNIX/POSIX, or FDs) in your application. You want the OS to notify you when FDs become active (have data ready to be read or are ready for writing). Ideally you want a mechanism that is scalable. This means a large number of inactive FDs cost very little in memory and CPU time to manage". * Title: "The Kernel Hacking HOWTO" Author: Various Talented People, and Rusty. Location: in kernel tree, Documentation/DocBook/kernel-hacking.tmpl (must be built as "make {htmldocs | psdocs | pdfdocs}) Keywords: HOWTO, kernel contexts, deadlock, locking, modules, symbols, return conventions. Description: From the Introduction: "Please understand that I never wanted to write this document, being grossly underqualified, but I always wanted to read it, and this was the only way. I simply explain some best practices, and give reading entry-points into the kernel sources. I avoid implementation details: that's what the code is for, and I ignore whole tracts of useful routines. This document assumes familiarity with C, and an understanding of what the kernel is, and how it is used. It was originally written for the 2.3 kernels, but nearly all of it applies to 2.2 too; 2.0 is slightly different". * Title: "Writing an ALSA Driver" Author: Takashi Iwai URL: http://www.alsa-project.org/~iwai/writing-an-alsa-driver/index.html Keywords: ALSA, sound, soundcard, driver, lowlevel, hardware. Description: Advanced Linux Sound Architecture for developers, both at kernel and user-level sides. ALSA is the Linux kernel sound architecture in the 2.6 kernel version. * Title: "Programming Guide for Linux USB Device Drivers" Author: Detlef Fliegl. URL: http://usb.in.tum.de/usbdoc/ Keywords: USB, universal serial bus. Description: A must-read. From the Preface: "This document should give detailed information about the current state of the USB subsystem and its API for USB device drivers. The first section will deal with the basics of USB devices. You will learn about different types of devices and their properties. Going into detail you will see how USB devices communicate on the bus. The second section gives an overview of the Linux USB subsystem [2] and the device driver framework. Then the API and its data structures will be explained step by step. The last section of this document contains a reference of all API calls and their return codes". Notes: Beware: the main page states: "This document may not be published, printed or used in excerpts without explicit permission of the author". Fortunately, it may still be read... * Title: "Linux Kernel Mailing List Glossary" Author: various URL: http://kernelnewbies.org/glossary/ Keywords: glossary, terms, linux-kernel. Description: From the introduction: "This glossary is intended as a brief description of some of the acronyms and terms you may hear during discussion of the Linux kernel". * Title: "Linux Kernel Locking HOWTO" Author: Various Talented People, and Rusty. Location: in kernel tree, Documentation/DocBook/kernel-locking.tmpl (must be built as "make {htmldocs | psdocs | pdfdocs}) Keywords: locks, locking, spinlock, semaphore, atomic, race condition, bottom halves, tasklets, softirqs. Description: The title says it all: document describing the locking system in the Linux Kernel either in uniprocessor or SMP systems. Notes: "It was originally written for the later (>2.3.47) 2.3 kernels, but most of it applies to 2.2 too; 2.0 is slightly different". Freely redistributable under the conditions of the GNU General Public License. * Title: "Global spinlock list and usage" Author: Rick Lindsley. URL: http://lse.sourceforge.net/lockhier/global-spin-lock Keywords: spinlock. Description: This is an attempt to document both the existence and usage of the spinlocks in the Linux 2.4.5 kernel. Comprehensive list of spinlocks showing when they are used, which functions access them, how each lock is acquired, under what conditions it is held, whether interrupts can occur or not while it is held... * Title: "Porting Linux 2.0 Drivers To Linux 2.2: Changes and New Features " Author: Alan Cox. URL: http://www.linux-mag.com/1999-05/gear_01.html Keywords: ports, porting. Description: Article from Linux Magazine on porting from 2.0 to 2.2 kernels. * Title: "Porting Device Drivers To Linux 2.2: part II" Author: Alan Cox. URL: http://www.linux-mag.com/id/238 Keywords: ports, porting. Description: Second part on porting from 2.0 to 2.2 kernels. * Title: "How To Make Sure Your Driver Will Work On The Power Macintosh" Author: Paul Mackerras. URL: http://www.linux-mag.com/id/261 Keywords: Mac, Power Macintosh, porting, drivers, compatibility. Description: The title says it all. * Title: "An Introduction to SCSI Drivers" Author: Alan Cox. URL: http://www.linux-mag.com/id/284 Keywords: SCSI, device, driver. Description: The title says it all. * Title: "Advanced SCSI Drivers And Other Tales" Author: Alan Cox. URL: http://www.linux-mag.com/id/307 Keywords: SCSI, device, driver, advanced. Description: The title says it all. * Title: "Writing Linux Mouse Drivers" Author: Alan Cox. URL: http://www.linux-mag.com/id/330 Keywords: mouse, driver, gpm. Description: The title says it all. * Title: "More on Mouse Drivers" Author: Alan Cox. URL: http://www.linux-mag.com/id/356 Keywords: mouse, driver, gpm, races, asynchronous I/O. Description: The title still says it all. * Title: "Writing Video4linux Radio Driver" Author: Alan Cox. URL: http://www.linux-mag.com/id/381 Keywords: video4linux, driver, radio, radio devices. Description: The title says it all. * Title: "Video4linux Drivers, Part 1: Video-Capture Device" Author: Alan Cox. URL: http://www.linux-mag.com/id/406 Keywords: video4linux, driver, video capture, capture devices, camera driver. Description: The title says it all. * Title: "Video4linux Drivers, Part 2: Video-capture Devices" Author: Alan Cox. URL: http://www.linux-mag.com/id/429 Keywords: video4linux, driver, video capture, capture devices, camera driver, control, query capabilities, capability, facility. Description: The title says it all. * Title: "PCI Management in Linux 2.2" Author: Alan Cox. URL: http://www.linux-mag.com/id/452 Keywords: PCI, bus, bus-mastering. Description: The title says it all. * Title: "Linux 2.4 Kernel Internals" Author: Tigran Aivazian and Christoph Hellwig. URL: http://www.moses.uklinux.net/patches/lki.html Keywords: Linux, kernel, booting, SMB boot, VFS, page cache. Description: A little book used for a short training course. Covers building the kernel image, booting (including SMP bootup), process management, VFS and more. * Title: "Linux IP Networking. A Guide to the Implementation and Modification of the Linux Protocol Stack." Author: Glenn Herrin. URL: http://www.cs.unh.edu/cnrg/gherrin Keywords: network, networking, protocol, IP, UDP, TCP, connection, socket, receiving, transmitting, forwarding, routing, packets, modules, /proc, sk_buff, FIB, tags. Description: Excellent paper devoted to the Linux IP Networking, explaining anything from the kernel's to the user space configuration tools' code. Very good to get a general overview of the kernel networking implementation and understand all steps packets follow from the time they are received at the network device till they are delivered to applications. The studied kernel code is from 2.2.14 version. Provides code for a working packet dropper example. * Title: "Get those boards talking under Linux." Author: Alex Ivchenko. URL: http://www.edn.com/article/CA46968.html Keywords: data-acquisition boards, drivers, modules, interrupts, memory allocation. Description: Article written for people wishing to make their data acquisition boards work on their GNU/Linux machines. Gives a basic overview on writing drivers, from the naming of functions to interrupt handling. Notes: Two-parts article. Part II is at URL: http://www.edn.com/article/CA46998.html * Title: "Linux PCMCIA Programmer's Guide" Author: David Hinds. URL: http://pcmcia-cs.sourceforge.net/ftp/doc/PCMCIA-PROG.html Keywords: PCMCIA. Description: "This document describes how to write kernel device drivers for the Linux PCMCIA Card Services interface. It also describes how to write user-mode utilities for communicating with Card Services. * Title: "The Linux Kernel NFSD Implementation" Author: Neil Brown. URL: http://www.cse.unsw.edu.au/~neilb/oss/linux-commentary/nfsd.html Keywords: knfsd, nfsd, NFS, RPC, lockd, mountd, statd. Description: The title says it all. Notes: Covers knfsd's version 1.4.7 (patch against 2.2.7 kernel). * Title: "A Linux vm README" Author: Kanoj Sarcar. URL: http://kos.enix.org/pub/linux-vmm.html Keywords: virtual memory, mm, pgd, vma, page, page flags, page cache, swap cache, kswapd. Description: Telegraphic, short descriptions and definitions relating the Linux virtual memory implementation. * Title: "(nearly) Complete Linux Loadable Kernel Modules. The definitive guide for hackers, virus coders and system administrators." Author: pragmatic/THC. URL: http://packetstormsecurity.org/docs/hack/LKM_HACKING.html Keywords: syscalls, intercept, hide, abuse, symbol table. Description: Interesting paper on how to abuse the Linux kernel in order to intercept and modify syscalls, make files/directories/processes invisible, become root, hijack ttys, write kernel modules based virus... and solutions for admins to avoid all those abuses. Notes: For 2.0.x kernels. Gives guidances to port it to 2.2.x kernels. BOOKS: (Not on-line) * Title: "Linux Device Drivers" Author: Alessandro Rubini. Publisher: O'Reilly & Associates. Date: 1998. Pages: 439. ISBN: 1-56592-292-1 * Title: "Linux Device Drivers, 2nd Edition" Author: Alessandro Rubini and Jonathan Corbet. Publisher: O'Reilly & Associates. Date: 2001. Pages: 586. ISBN: 0-59600-008-1 Notes: Further information in http://www.oreilly.com/catalog/linuxdrive2/ * Title: "Linux Device Drivers, 3rd Edition" Authors: Jonathan Corbet, Alessandro Rubini, and Greg Kroah-Hartman Publisher: O'Reilly & Associates. Date: 2005. Pages: 636. ISBN: 0-596-00590-3 Notes: Further information in http://www.oreilly.com/catalog/linuxdrive3/ PDF format, URL: http://lwn.net/Kernel/LDD3/ * Title: "Linux Kernel Internals" Author: Michael Beck. Publisher: Addison-Wesley. Date: 1997. ISBN: 0-201-33143-8 (second edition) * Title: "The Design of the UNIX Operating System" Author: Maurice J. Bach. Publisher: Prentice Hall. Date: 1986. Pages: 471. ISBN: 0-13-201757-1 * Title: "The Design and Implementation of the 4.3 BSD UNIX Operating System" Author: Samuel J. Leffler, Marshall Kirk McKusick, Michael J. Karels, John S. Quarterman. Publisher: Addison-Wesley. Date: 1989 (reprinted with corrections on October, 1990). ISBN: 0-201-06196-1 * Title: "The Design and Implementation of the 4.4 BSD UNIX Operating System" Author: Marshall Kirk McKusick, Keith Bostic, Michael J. Karels, John S. Quarterman. Publisher: Addison-Wesley. Date: 1996. ISBN: 0-201-54979-4 * Title: "Programmation Linux 2.0 API systeme et fonctionnement du noyau" Author: Remy Card, Eric Dumas, Franck Mevel. Publisher: Eyrolles. Date: 1997. Pages: 520. ISBN: 2-212-08932-5 Notes: French. * Title: "Unix internals -- the new frontiers" Author: Uresh Vahalia. Publisher: Prentice Hall. Date: 1996. Pages: 600. ISBN: 0-13-101908-2 * Title: "Programming for the real world - POSIX.4" Author: Bill O. Gallmeister. Publisher: O'Reilly & Associates, Inc.. Date: 1995. Pages: ???. ISBN: I-56592-074-0 Notes: Though not being directly about Linux, Linux aims to be POSIX. Good reference. * Title: "UNIX Systems for Modern Architectures: Symmetric Multiprocessing and Caching for Kernel Programmers" Author: Curt Schimmel. Publisher: Addison Wesley. Date: June, 1994. Pages: 432. ISBN: 0-201-63338-8 MISCELLANEOUS: * Name: linux/Documentation Author: Many. URL: Just look inside your kernel sources. Keywords: anything, DocBook. Description: Documentation that comes with the kernel sources, inside the Documentation directory. Some pages from this document (including this document itself) have been moved there, and might be more up to date than the web version. * Name: "Linux Kernel Source Reference" Author: Thomas Graichen. URL: http://marc.info/?l=linux-kernel&m=96446640102205&w=4 Keywords: CVS, web, cvsweb, browsing source code. Description: Web interface to a CVS server with the kernel sources. "Here you can have a look at any file of the Linux kernel sources of any version starting from 1.0 up to the (daily updated) current version available. Also you can check the differences between two versions of a file". * Name: "Cross-Referencing Linux" URL: http://lxr.linux.no/source/ Keywords: Browsing source code. Description: Another web-based Linux kernel source code browser. Lots of cross references to variables and functions. You can see where they are defined and where they are used. * Name: "Linux Weekly News" URL: http://lwn.net Keywords: latest kernel news. Description: The title says it all. There's a fixed kernel section summarizing developers' work, bug fixes, new features and versions produced during the week. Published every Thursday. * Name: "Kernel Traffic" URL: http://kt.earth.li/kernel-traffic/index.html Keywords: linux-kernel mailing list, weekly kernel news. Description: Weekly newsletter covering the most relevant discussions of the linux-kernel mailing list. * Name: "CuTTiNG.eDGe.LiNuX" URL: http://edge.kernelnotes.org Keywords: changelist. Description: Site which provides the changelist for every kernel release. What's new, what's better, what's changed. Myrdraal reads the patches and describes them. Pointers to the patches are there, too. * Name: "New linux-kernel Mailing List FAQ" URL: http://www.tux.org/lkml/ Keywords: linux-kernel mailing list FAQ. Description: linux-kernel is a mailing list for developers to communicate. This FAQ builds on the previous linux-kernel mailing list FAQ maintained by Frohwalt Egerer, who no longer maintains it. Read it to see how to join the mailing list. Dozens of interesting questions regarding the list, Linux, developers (who is ...?), terms (what is...?) are answered here too. Just read it. * Name: "Linux Virtual File System" Author: Peter J. Braam. URL: http://www.coda.cs.cmu.edu/doc/talks/linuxvfs/ Keywords: slides, VFS, inode, superblock, dentry, dcache. Description: Set of slides, presumably from a presentation on the Linux VFS layer. Covers version 2.1.x, with dentries and the dcache. * Name: "Gary's Encyclopedia - The Linux Kernel" Author: Gary (I suppose...). URL: http://slencyclopedia.berlios.de/index.html Keywords: linux, community, everything! Description: Gary's Encyclopedia exists to allow the rapid finding of documentation and other information of interest to GNU/Linux users. It has about 4000 links to external pages in 150 major categories. This link is for kernel-specific links, documents, sites... This list is now hosted by developer.Berlios.de, but seems not to have been updated since sometime in 1999. * Name: "The home page of Linux-MM" Author: The Linux-MM team. URL: http://linux-mm.org/ Keywords: memory management, Linux-MM, mm patches, TODO, docs, mailing list. Description: Site devoted to Linux Memory Management development. Memory related patches, HOWTOs, links, mm developers... Don't miss it if you are interested in memory management development! * Name: "Kernel Newbies IRC Channel" URL: http://www.kernelnewbies.org Keywords: IRC, newbies, channel, asking doubts. Description: #kernelnewbies on irc.openprojects.net. From the web page: "#kernelnewbies is an IRC network dedicated to the 'newbie' kernel hacker. The audience mostly consists of people who are learning about the kernel, working on kernel projects or professional kernel hackers that want to help less seasoned kernel people. [...] #kernelnewbies is on the Open Projects IRC Network, try irc.openprojects.net or irc..openprojects.net as your server and then /join #kernelnewbies". It also hosts articles, documents, FAQs... * Name: "linux-kernel mailing list archives and search engines" URL: http://vger.kernel.org/vger-lists.html URL: http://www.uwsg.indiana.edu/hypermail/linux/kernel/index.html URL: http://marc.theaimsgroup.com/?l=linux-kernel URL: http://groups.google.com/group/mlist.linux.kernel URL: http://www.cs.helsinki.fi/linux/linux-kernel/ URL: http://www.lib.uaa.alaska.edu/linux-kernel/ Keywords: linux-kernel, archives, search. Description: Some of the linux-kernel mailing list archivers. If you have a better/another one, please let me know. _________________________________________________________________ Document last updated on Sat 2005-NOV-19 Kernel Parameters ~~~~~~~~~~~~~~~~~ The following is a consolidated list of the kernel parameters as implemented (mostly) by the __setup() macro and sorted into English Dictionary order (defined as ignoring all punctuation and sorting digits before letters in a case insensitive manner), and with descriptions where known. Module parameters for loadable modules are specified only as the parameter name with optional '=' and value as appropriate, such as: modprobe usbcore blinkenlights=1 Module parameters for modules that are built into the kernel image are specified on the kernel command line with the module name plus '.' plus parameter name, with '=' and value if appropriate, such as: usbcore.blinkenlights=1 Hyphens (dashes) and underscores are equivalent in parameter names, so log_buf_len=1M print-fatal-signals=1 can also be entered as log-buf-len=1M print_fatal_signals=1 This document may not be entirely up to date and comprehensive. The command "modinfo -p ${modulename}" shows a current list of all parameters of a loadable module. Loadable modules, after being loaded into the running kernel, also reveal their parameters in /sys/module/${modulename}/parameters/. Some of these parameters may be changed at runtime by the command "echo -n ${value} > /sys/module/${modulename}/parameters/${parm}". The parameters listed below are only valid if certain kernel build options were enabled and if respective hardware is present. The text in square brackets at the beginning of each description states the restrictions within which a parameter is applicable: ACPI ACPI support is enabled. AGP AGP (Accelerated Graphics Port) is enabled. ALSA ALSA sound support is enabled. APIC APIC support is enabled. APM Advanced Power Management support is enabled. ARM ARM architecture is enabled. AVR32 AVR32 architecture is enabled. AX25 Appropriate AX.25 support is enabled. BLACKFIN Blackfin architecture is enabled. DRM Direct Rendering Management support is enabled. DYNAMIC_DEBUG Build in debug messages and enable them at runtime EDD BIOS Enhanced Disk Drive Services (EDD) is enabled EFI EFI Partitioning (GPT) is enabled EIDE EIDE/ATAPI support is enabled. EVM Extended Verification Module FB The frame buffer device is enabled. FTRACE Function tracing enabled. GCOV GCOV profiling is enabled. HW Appropriate hardware is enabled. IA-64 IA-64 architecture is enabled. IMA Integrity measurement architecture is enabled. IOSCHED More than one I/O scheduler is enabled. IP_PNP IP DHCP, BOOTP, or RARP is enabled. IPV6 IPv6 support is enabled. ISAPNP ISA PnP code is enabled. ISDN Appropriate ISDN support is enabled. JOY Appropriate joystick support is enabled. KGDB Kernel debugger support is enabled. KVM Kernel Virtual Machine support is enabled. LIBATA Libata driver is enabled LP Printer support is enabled. LOOP Loopback device support is enabled. M68k M68k architecture is enabled. These options have more detailed description inside of Documentation/m68k/kernel-options.txt. MCA MCA bus support is enabled. MDA MDA console support is enabled. MIPS MIPS architecture is enabled. MOUSE Appropriate mouse support is enabled. MSI Message Signaled Interrupts (PCI). MTD MTD (Memory Technology Device) support is enabled. NET Appropriate network support is enabled. NUMA NUMA support is enabled. NFS Appropriate NFS support is enabled. OSS OSS sound support is enabled. PV_OPS A paravirtualized kernel is enabled. PARIDE The ParIDE (parallel port IDE) subsystem is enabled. PARISC The PA-RISC architecture is enabled. PCI PCI bus support is enabled. PCIE PCI Express support is enabled. PCMCIA The PCMCIA subsystem is enabled. PNP Plug & Play support is enabled. PPC PowerPC architecture is enabled. PPT Parallel port support is enabled. PS2 Appropriate PS/2 support is enabled. RAM RAM disk support is enabled. S390 S390 architecture is enabled. SCSI Appropriate SCSI support is enabled. A lot of drivers have their options described inside the Documentation/scsi/ sub-directory. SECURITY Different security models are enabled. SELINUX SELinux support is enabled. APPARMOR AppArmor support is enabled. SERIAL Serial support is enabled. SH SuperH architecture is enabled. SMP The kernel is an SMP kernel. SPARC Sparc architecture is enabled. SWSUSP Software suspend (hibernation) is enabled. SUSPEND System suspend states are enabled. TPM TPM drivers are enabled. TS Appropriate touchscreen support is enabled. UMS USB Mass Storage support is enabled. USB USB support is enabled. USBHID USB Human Interface Device support is enabled. V4L Video For Linux support is enabled. VGA The VGA console has been enabled. VT Virtual terminal support is enabled. WDT Watchdog support is enabled. XT IBM PC/XT MFM hard disk support is enabled. X86-32 X86-32, aka i386 architecture is enabled. X86-64 X86-64 architecture is enabled. More X86-64 boot options can be found in Documentation/x86/x86_64/boot-options.txt . X86 Either 32-bit or 64-bit x86 (same as X86-32+X86-64) XEN Xen support is enabled In addition, the following text indicates that the option: BUGS= Relates to possible processor bugs on the said processor. KNL Is a kernel start-up parameter. BOOT Is a boot loader parameter. Parameters denoted with BOOT are actually interpreted by the boot loader, and have no meaning to the kernel directly. Do not modify the syntax of boot loader parameters without extreme need or coordination with . There are also arch-specific kernel-parameters not documented here. See for example . Note that ALL kernel parameters listed below are CASE SENSITIVE, and that a trailing = on the name of any parameter states that that parameter will be entered as an environment variable, whereas its absence indicates that it will appear as a kernel argument readable via /proc/cmdline by programs running once the system is up. The number of kernel parameters is not limited, but the length of the complete command line (parameters including spaces etc.) is limited to a fixed number of characters. This limit depends on the architecture and is between 256 and 4096 characters. It is defined in the file ./include/asm/setup.h as COMMAND_LINE_SIZE. Finally, the [KMG] suffix is commonly described after a number of kernel parameter values. These 'K', 'M', and 'G' letters represent the _binary_ multipliers 'Kilo', 'Mega', and 'Giga', equalling 2^10, 2^20, and 2^30 bytes respectively. Such letter suffixes can also be entirely omitted. acpi= [HW,ACPI,X86] Advanced Configuration and Power Interface Format: { force | off | strict | noirq | rsdt } force -- enable ACPI if default was off off -- disable ACPI if default was on noirq -- do not use ACPI for IRQ routing strict -- Be less tolerant of platforms that are not strictly ACPI specification compliant. rsdt -- prefer RSDT over (default) XSDT copy_dsdt -- copy DSDT to memory See also Documentation/power/runtime_pm.txt, pci=noacpi acpi_rsdp= [ACPI,EFI,KEXEC] Pass the RSDP address to the kernel, mostly used on machines running EFI runtime service to boot the second kernel for kdump. acpi_apic_instance= [ACPI, IOAPIC] Format: 2: use 2nd APIC table, if available 1,0: use 1st APIC table default: 0 acpi_backlight= [HW,ACPI] acpi_backlight=vendor acpi_backlight=video If set to vendor, prefer vendor specific driver (e.g. thinkpad_acpi, sony_acpi, etc.) instead of the ACPI video.ko driver. acpi.debug_layer= [HW,ACPI,ACPI_DEBUG] acpi.debug_level= [HW,ACPI,ACPI_DEBUG] Format: CONFIG_ACPI_DEBUG must be enabled to produce any ACPI debug output. Bits in debug_layer correspond to a _COMPONENT in an ACPI source file, e.g., #define _COMPONENT ACPI_PCI_COMPONENT Bits in debug_level correspond to a level in ACPI_DEBUG_PRINT statements, e.g., ACPI_DEBUG_PRINT((ACPI_DB_INFO, ... The debug_level mask defaults to "info". See Documentation/acpi/debug.txt for more information about debug layers and levels. Enable processor driver info messages: acpi.debug_layer=0x20000000 Enable PCI/PCI interrupt routing info messages: acpi.debug_layer=0x400000 Enable AML "Debug" output, i.e., stores to the Debug object while interpreting AML: acpi.debug_layer=0xffffffff acpi.debug_level=0x2 Enable all messages related to ACPI hardware: acpi.debug_layer=0x2 acpi.debug_level=0xffffffff Some values produce so much output that the system is unusable. The "log_buf_len" parameter may be useful if you need to capture more output. acpi_irq_balance [HW,ACPI] ACPI will balance active IRQs default in APIC mode acpi_irq_nobalance [HW,ACPI] ACPI will not move active IRQs (default) default in PIC mode acpi_irq_isa= [HW,ACPI] If irq_balance, mark listed IRQs used by ISA Format: ,... acpi_irq_pci= [HW,ACPI] If irq_balance, clear listed IRQs for use by PCI Format: ,... acpi_no_auto_ssdt [HW,ACPI] Disable automatic loading of SSDT acpi_os_name= [HW,ACPI] Tell ACPI BIOS the name of the OS Format: To spoof as Windows 98: ="Microsoft Windows" acpi_osi= [HW,ACPI] Modify list of supported OS interface strings acpi_osi="string1" # add string1 -- only one string acpi_osi="!string2" # remove built-in string2 acpi_osi= # disable all strings acpi_pm_good [X86] Override the pmtimer bug detection: force the kernel to assume that this machine's pmtimer latches its value and always returns good values. acpi_sci= [HW,ACPI] ACPI System Control Interrupt trigger mode Format: { level | edge | high | low } acpi_serialize [HW,ACPI] force serialization of AML methods acpi_skip_timer_override [HW,ACPI] Recognize and ignore IRQ0/pin2 Interrupt Override. For broken nForce2 BIOS resulting in XT-PIC timer. acpi_sleep= [HW,ACPI] Sleep options Format: { s3_bios, s3_mode, s3_beep, s4_nohwsig, old_ordering, nonvs, sci_force_enable } See Documentation/power/video.txt for information on s3_bios and s3_mode. s3_beep is for debugging; it makes the PC's speaker beep as soon as the kernel's real-mode entry point is called. s4_nohwsig prevents ACPI hardware signature from being used during resume from hibernation. old_ordering causes the ACPI 1.0 ordering of the _PTS control method, with respect to putting devices into low power states, to be enforced (the ACPI 2.0 ordering of _PTS is used by default). nonvs prevents the kernel from saving/restoring the ACPI NVS memory during suspend/hibernation and resume. sci_force_enable causes the kernel to set SCI_EN directly on resume from S1/S3 (which is against the ACPI spec, but some broken systems don't work without it). acpi_use_timer_override [HW,ACPI] Use timer override. For some broken Nvidia NF5 boards that require a timer override, but don't have HPET acpi_enforce_resources= [ACPI] { strict | lax | no } Check for resource conflicts between native drivers and ACPI OperationRegions (SystemIO and SystemMemory only). IO ports and memory declared in ACPI might be used by the ACPI subsystem in arbitrary AML code and can interfere with legacy drivers. strict (default): access to resources claimed by ACPI is denied; legacy drivers trying to access reserved resources will fail to bind to device using them. lax: access to resources claimed by ACPI is allowed; legacy drivers trying to access reserved resources will bind successfully but a warning message is logged. no: ACPI OperationRegions are not marked as reserved, no further checks are performed. add_efi_memmap [EFI; X86] Include EFI memory map in kernel's map of available physical RAM. agp= [AGP] { off | try_unsupported } off: disable AGP support try_unsupported: try to drive unsupported chipsets (may crash computer or cause data corruption) ALSA [HW,ALSA] See Documentation/sound/alsa/alsa-parameters.txt alignment= [KNL,ARM] Allow the default userspace alignment fault handler behaviour to be specified. Bit 0 enables warnings, bit 1 enables fixups, and bit 2 sends a segfault. align_va_addr= [X86-64] Align virtual addresses by clearing slice [14:12] when allocating a VMA at process creation time. This option gives you up to 3% performance improvement on AMD F15h machines (where it is enabled by default) for a CPU-intensive style benchmark, and it can vary highly in a microbenchmark depending on workload and compiler. 32: only for 32-bit processes 64: only for 64-bit processes on: enable for both 32- and 64-bit processes off: disable for both 32- and 64-bit processes amd_iommu= [HW,X86-64] Pass parameters to the AMD IOMMU driver in the system. Possible values are: fullflush - enable flushing of IO/TLB entries when they are unmapped. Otherwise they are flushed before they will be reused, which is a lot of faster off - do not initialize any AMD IOMMU found in the system force_isolation - Force device isolation for all devices. The IOMMU driver is not allowed anymore to lift isolation requirements as needed. This option does not override iommu=pt amijoy.map= [HW,JOY] Amiga joystick support Map of devices attached to JOY0DAT and JOY1DAT Format: , See also Documentation/input/joystick.txt analog.map= [HW,JOY] Analog joystick and gamepad support Specifies type or capabilities of an analog joystick connected to one of 16 gameports Format: ,,.. apc= [HW,SPARC] Power management functions (SPARCstation-4/5 + deriv.) Format: noidle Disable APC CPU standby support. SPARCstation-Fox does not play well with APC CPU idle - disable it if you have APC and your system crashes randomly. apic= [APIC,X86-32] Advanced Programmable Interrupt Controller Change the output verbosity whilst booting Format: { quiet (default) | verbose | debug } Change the amount of debugging information output when initialising the APIC and IO-APIC components. autoconf= [IPV6] See Documentation/networking/ipv6.txt. show_lapic= [APIC,X86] Advanced Programmable Interrupt Controller Limit apic dumping. The parameter defines the maximal number of local apics being dumped. Also it is possible to set it to "all" by meaning -- no limit here. Format: { 1 (default) | 2 | ... | all }. The parameter valid if only apic=debug or apic=verbose is specified. Example: apic=debug show_lapic=all apm= [APM] Advanced Power Management See header of arch/x86/kernel/apm_32.c. arcrimi= [HW,NET] ARCnet - "RIM I" (entirely mem-mapped) cards Format: ,, ataflop= [HW,M68k] atarimouse= [HW,MOUSE] Atari Mouse atkbd.extra= [HW] Enable extra LEDs and keys on IBM RapidAccess, EzKey and similar keyboards atkbd.reset= [HW] Reset keyboard during initialization atkbd.set= [HW] Select keyboard code set Format: (2 = AT (default), 3 = PS/2) atkbd.scroll= [HW] Enable scroll wheel on MS Office and similar keyboards atkbd.softraw= [HW] Choose between synthetic and real raw mode Format: (0 = real, 1 = synthetic (default)) atkbd.softrepeat= [HW] Use software keyboard repeat autotest [IA-64] baycom_epp= [HW,AX25] Format: , baycom_par= [HW,AX25] BayCom Parallel Port AX.25 Modem Format: , See header of drivers/net/hamradio/baycom_par.c. baycom_ser_fdx= [HW,AX25] BayCom Serial Port AX.25 Modem (Full Duplex Mode) Format: ,,[,] See header of drivers/net/hamradio/baycom_ser_fdx.c. baycom_ser_hdx= [HW,AX25] BayCom Serial Port AX.25 Modem (Half Duplex Mode) Format: ,, See header of drivers/net/hamradio/baycom_ser_hdx.c. boot_delay= Milliseconds to delay each printk during boot. Values larger than 10 seconds (10000) are changed to no delay (0). Format: integer bootmem_debug [KNL] Enable bootmem allocator debug messages. bttv.card= [HW,V4L] bttv (bt848 + bt878 based grabber cards) bttv.radio= Most important insmod options are available as kernel args too. bttv.pll= See Documentation/video4linux/bttv/Insmod-options bttv.tuner= bulk_remove=off [PPC] This parameter disables the use of the pSeries firmware feature for flushing multiple hpte entries at a time. c101= [NET] Moxa C101 synchronous serial card cachesize= [BUGS=X86-32] Override level 2 CPU cache size detection. Sometimes CPU hardware bugs make them report the cache size incorrectly. The kernel will attempt work arounds to fix known problems, but for some CPUs it is not possible to determine what the correct size should be. This option provides an override for these situations. capability.disable= [SECURITY] Disable capabilities. This would normally be used only if an alternative security model is to be configured. Potentially dangerous and should only be used if you are entirely sure of the consequences. ccw_timeout_log [S390] See Documentation/s390/CommonIO for details. cgroup_disable= [KNL] Disable a particular controller Format: {name of the controller(s) to disable} {Currently supported controllers - "memory"} checkreqprot [SELINUX] Set initial checkreqprot flag value. Format: { "0" | "1" } See security/selinux/Kconfig help text. 0 -- check protection applied by kernel (includes any implied execute protection). 1 -- check protection requested by application. Default value is set via a kernel config option. Value can be changed at runtime via /selinux/checkreqprot. cio_ignore= [S390] See Documentation/s390/CommonIO for details. clock= [BUGS=X86-32, HW] gettimeofday clocksource override. [Deprecated] Forces specified clocksource (if available) to be used when calculating gettimeofday(). If specified clocksource is not available, it defaults to PIT. Format: { pit | tsc | cyclone | pmtmr } clocksource= Override the default clocksource Format: Override the default clocksource and use the clocksource with the name specified. Some clocksource names to choose from, depending on the platform: [all] jiffies (this is the base, fallback clocksource) [ACPI] acpi_pm [ARM] imx_timer1,OSTS,netx_timer,mpu_timer2, pxa_timer,timer3,32k_counter,timer0_1 [AVR32] avr32 [X86-32] pit,hpet,tsc; scx200_hrt on Geode; cyclone on IBM x440 [MIPS] MIPS [PARISC] cr16 [S390] tod [SH] SuperH [SPARC64] tick [X86-64] hpet,tsc clearcpuid=BITNUM [X86] Disable CPUID feature X for the kernel. See arch/x86/include/asm/cpufeature.h for the valid bit numbers. Note the Linux specific bits are not necessarily stable over kernel options, but the vendor specific ones should be. Also note that user programs calling CPUID directly or using the feature without checking anything will still see it. This just prevents it from being used by the kernel or shown in /proc/cpuinfo. Also note the kernel might malfunction if you disable some critical bits. cmo_free_hint= [PPC] Format: { yes | no } Specify whether pages are marked as being inactive when they are freed. This is used in CMO environments to determine OS memory pressure for page stealing by a hypervisor. Default: yes code_bytes [X86] How many bytes of object code to print in an oops report. Range: 0 - 8192 Default: 64 com20020= [HW,NET] ARCnet - COM20020 chipset Format: [,[,[,[,[,]]]]] com90io= [HW,NET] ARCnet - COM90xx chipset (IO-mapped buffers) Format: [,] com90xx= [HW,NET] ARCnet - COM90xx chipset (memory-mapped buffers) Format: [,[,]] condev= [HW,S390] console device conmode= console= [KNL] Output console device and options. tty Use the virtual console device . ttyS[,options] ttyUSB0[,options] Use the specified serial port. The options are of the form "bbbbpnf", where "bbbb" is the baud rate, "p" is parity ("n", "o", or "e"), "n" is number of bits, and "f" is flow control ("r" for RTS or omit it). Default is "9600n8". See Documentation/serial-console.txt for more information. See Documentation/networking/netconsole.txt for an alternative. uart[8250],io,[,options] uart[8250],mmio,[,options] Start an early, polled-mode console on the 8250/16550 UART at the specified I/O port or MMIO address, switching to the matching ttyS device later. The options are the same as for ttyS, above. If the device connected to the port is not a TTY but a braille device, prepend "brl," before the device type, for instance console=brl,ttyS0 For now, only VisioBraille is supported. consoleblank= [KNL] The console blank (screen saver) timeout in seconds. Defaults to 10*60 = 10mins. A value of 0 disables the blank timer. coredump_filter= [KNL] Change the default value for /proc//coredump_filter. See also Documentation/filesystems/proc.txt. cpuidle.off=1 [CPU_IDLE] disable the cpuidle sub-system cpcihp_generic= [HW,PCI] Generic port I/O CompactPCI driver Format: ,,,[,] crashkernel=size[KMG][@offset[KMG]] [KNL] Using kexec, Linux can switch to a 'crash kernel' upon panic. This parameter reserves the physical memory region [offset, offset + size] for that kernel image. If '@offset' is omitted, then a suitable offset is selected automatically. Check Documentation/kdump/kdump.txt for further details. crashkernel=range1:size1[,range2:size2,...][@offset] [KNL] Same as above, but depends on the memory in the running system. The syntax of range is start-[end] where start and end are both a memory unit (amount[KMG]). See also Documentation/kdump/kdump.txt for an example. cs89x0_dma= [HW,NET] Format: cs89x0_media= [HW,NET] Format: { rj45 | aui | bnc } dasd= [HW,NET] See header of drivers/s390/block/dasd_devmap.c. db9.dev[2|3]= [HW,JOY] Multisystem joystick support via parallel port (one device per port) Format: , See also Documentation/input/joystick-parport.txt ddebug_query= [KNL,DYNAMIC_DEBUG] Enable debug messages at early boot time. See Documentation/dynamic-debug-howto.txt for details. debug [KNL] Enable kernel debugging (events log level). debug_locks_verbose= [KNL] verbose self-tests Format=<0|1> Print debugging info while doing the locking API self-tests. We default to 0 (no extra messages), setting it to 1 will print _a lot_ more information - normally only useful to kernel developers. debug_objects [KNL] Enable object debugging no_debug_objects [KNL] Disable object debugging debug_guardpage_minorder= [KNL] When CONFIG_DEBUG_PAGEALLOC is set, this parameter allows control of the order of pages that will be intentionally kept free (and hence protected) by the buddy allocator. Bigger value increase the probability of catching random memory corruption, but reduce the amount of memory for normal system use. The maximum possible value is MAX_ORDER/2. Setting this parameter to 1 or 2 should be enough to identify most random memory corruption problems caused by bugs in kernel or driver code when a CPU writes to (or reads from) a random memory location. Note that there exists a class of memory corruptions problems caused by buggy H/W or F/W or by drivers badly programing DMA (basically when memory is written at bus level and the CPU MMU is bypassed) which are not detectable by CONFIG_DEBUG_PAGEALLOC, hence this option will not help tracking down these problems. debugpat [X86] Enable PAT debugging decnet.addr= [HW,NET] Format: [,] See also Documentation/networking/decnet.txt. default_hugepagesz= [same as hugepagesz=] The size of the default HugeTLB page size. This is the size represented by the legacy /proc/ hugepages APIs, used for SHM, and default size when mounting hugetlbfs filesystems. Defaults to the default architecture's huge page size if not specified. dhash_entries= [KNL] Set number of hash buckets for dentry cache. digi= [HW,SERIAL] IO parameters + enable/disable command. digiepca= [HW,SERIAL] See drivers/char/README.epca and Documentation/serial/digiepca.txt. disable= [IPV6] See Documentation/networking/ipv6.txt. disable_ddw [PPC/PSERIES] Disable Dynamic DMA Window support. Use this if to workaround buggy firmware. disable_ipv6= [IPV6] See Documentation/networking/ipv6.txt. disable_mtrr_cleanup [X86] The kernel tries to adjust MTRR layout from continuous to discrete, to make X server driver able to add WB entry later. This parameter disables that. disable_mtrr_trim [X86, Intel and AMD only] By default the kernel will trim any uncacheable memory out of your available memory pool based on MTRR settings. This parameter disables that behavior, possibly causing your machine to run very slowly. disable_timer_pin_1 [X86] Disable PIN 1 of APIC timer Can be useful to work around chipset bugs. dma_debug=off If the kernel is compiled with DMA_API_DEBUG support, this option disables the debugging code at boot. dma_debug_entries= This option allows to tune the number of preallocated entries for DMA-API debugging code. One entry is required per DMA-API allocation. Use this if the DMA-API debugging code disables itself because the architectural default is too low. dma_debug_driver= With this option the DMA-API debugging driver filter feature can be enabled at boot time. Just pass the driver to filter for as the parameter. The filter can be disabled or changed to another driver later using sysfs. dscc4.setup= [NET] earlycon= [KNL] Output early console device and options. uart[8250],io,[,options] uart[8250],mmio,[,options] uart[8250],mmio32,[,options] Start an early, polled-mode console on the 8250/16550 UART at the specified I/O port or MMIO address. MMIO inter-register address stride is either 8-bit (mmio) or 32-bit (mmio32). The options are the same as for ttyS, above. earlyprintk= [X86,SH,BLACKFIN] earlyprintk=vga earlyprintk=serial[,ttySn[,baudrate]] earlyprintk=ttySn[,baudrate] earlyprintk=dbgp[debugController#] Append ",keep" to not disable it when the real console takes over. Only vga or serial or usb debug port at a time. Currently only ttyS0 and ttyS1 are supported. Interaction with the standard serial driver is not very good. The VGA output is eventually overwritten by the real console. ekgdboc= [X86,KGDB] Allow early kernel console debugging ekgdboc=kbd This is designed to be used in conjunction with the boot argument: earlyprintk=vga edd= [EDD] Format: {"off" | "on" | "skip[mbr]"} eisa_irq_edge= [PARISC,HW] See header of drivers/parisc/eisa.c. elanfreq= [X86-32] See comment before function elanfreq_setup() in arch/x86/kernel/cpu/cpufreq/elanfreq.c. elevator= [IOSCHED] Format: {"cfq" | "deadline" | "noop"} See Documentation/block/cfq-iosched.txt and Documentation/block/deadline-iosched.txt for details. elfcorehdr=[size[KMG]@]offset[KMG] [IA64,PPC,SH,X86,S390] Specifies physical address of start of kernel core image elf header and optionally the size. Generally kexec loader will pass this option to capture kernel. See Documentation/kdump/kdump.txt for details. enable_mtrr_cleanup [X86] The kernel tries to adjust MTRR layout from continuous to discrete, to make X server driver able to add WB entry later. This parameter enables that. enable_timer_pin_1 [X86] Enable PIN 1 of APIC timer Can be useful to work around chipset bugs (in particular on some ATI chipsets). The kernel tries to set a reasonable default. enforcing [SELINUX] Set initial enforcing status. Format: {"0" | "1"} See security/selinux/Kconfig help text. 0 -- permissive (log only, no denials). 1 -- enforcing (deny and log). Default value is 0. Value can be changed at runtime via /selinux/enforce. erst_disable [ACPI] Disable Error Record Serialization Table (ERST) support. ether= [HW,NET] Ethernet cards parameters This option is obsoleted by the "netdev=" option, which has equivalent usage. See its documentation for details. evm= [EVM] Format: { "fix" } Permit 'security.evm' to be updated regardless of current integrity status. failslab= fail_page_alloc= fail_make_request=[KNL] General fault injection mechanism. Format: ,,, See also Documentation/fault-injection/. floppy= [HW] See Documentation/blockdev/floppy.txt. force_pal_cache_flush [IA-64] Avoid check_sal_cache_flush which may hang on buggy SAL_CACHE_FLUSH implementations. Using this parameter will force ia64_sal_cache_flush to call ia64_pal_cache_flush instead of SAL_CACHE_FLUSH. ftrace=[tracer] [FTRACE] will set and start the specified tracer as early as possible in order to facilitate early boot debugging. ftrace_dump_on_oops[=orig_cpu] [FTRACE] will dump the trace buffers on oops. If no parameter is passed, ftrace will dump buffers of all CPUs, but if you pass orig_cpu, it will dump only the buffer of the CPU that triggered the oops. ftrace_filter=[function-list] [FTRACE] Limit the functions traced by the function tracer at boot up. function-list is a comma separated list of functions. This list can be changed at run time by the set_ftrace_filter file in the debugfs tracing directory. ftrace_notrace=[function-list] [FTRACE] Do not trace the functions specified in function-list. This list can be changed at run time by the set_ftrace_notrace file in the debugfs tracing directory. ftrace_graph_filter=[function-list] [FTRACE] Limit the top level callers functions traced by the function graph tracer at boot up. function-list is a comma separated list of functions that can be changed at run time by the set_graph_function file in the debugfs tracing directory. gamecon.map[2|3]= [HW,JOY] Multisystem joystick and NES/SNES/PSX pad support via parallel port (up to 5 devices per port) Format: ,,,,, See also Documentation/input/joystick-parport.txt gamma= [HW,DRM] gart_fix_e820= [X86_64] disable the fix e820 for K8 GART Format: off | on default: on gcov_persist= [GCOV] When non-zero (default), profiling data for kernel modules is saved and remains accessible via debugfs, even when the module is unloaded/reloaded. When zero, profiling data is discarded and associated debugfs files are removed at module unload time. gpt [EFI] Forces disk with valid GPT signature but invalid Protective MBR to be treated as GPT. hashdist= [KNL,NUMA] Large hashes allocated during boot are distributed across NUMA nodes. Defaults on for 64-bit NUMA, off otherwise. Format: 0 | 1 (for off | on) hcl= [IA-64] SGI's Hardware Graph compatibility layer hd= [EIDE] (E)IDE hard drive subsystem geometry Format: ,, hest_disable [ACPI] Disable Hardware Error Source Table (HEST) support; corresponding firmware-first mode error processing logic will be disabled. highmem=nn[KMG] [KNL,BOOT] forces the highmem zone to have an exact size of . This works even on boxes that have no highmem otherwise. This also works to reduce highmem size on bigger boxes. highres= [KNL] Enable/disable high resolution timer mode. Valid parameters: "on", "off" Default: "on" hisax= [HW,ISDN] See Documentation/isdn/README.HiSax. hlt [BUGS=ARM,SH] hpet= [X86-32,HPET] option to control HPET usage Format: { enable (default) | disable | force | verbose } disable: disable HPET and use PIT instead force: allow force enabled of undocumented chips (ICH4, VIA, nVidia) verbose: show contents of HPET registers during setup hugepages= [HW,X86-32,IA-64] HugeTLB pages to allocate at boot. hugepagesz= [HW,IA-64,PPC,X86-64] The size of the HugeTLB pages. On x86-64 and powerpc, this option can be specified multiple times interleaved with hugepages= to reserve huge pages of different sizes. Valid pages sizes on x86-64 are 2M (when the CPU supports "pse") and 1G (when the CPU supports the "pdpe1gb" cpuinfo flag) Note that 1GB pages can only be allocated at boot time using hugepages= and not freed afterwards. hvc_iucv= [S390] Number of z/VM IUCV hypervisor console (HVC) terminal devices. Valid values: 0..8 hvc_iucv_allow= [S390] Comma-separated list of z/VM user IDs. If specified, z/VM IUCV HVC accepts connections from listed z/VM user IDs only. keep_bootcon [KNL] Do not unregister boot console at start. This is only useful for debugging when something happens in the window between unregistering the boot console and initializing the real console. i2c_bus= [HW] Override the default board specific I2C bus speed or register an additional I2C bus that is not registered from board initialization code. Format: , i8042.debug [HW] Toggle i8042 debug mode i8042.direct [HW] Put keyboard port into non-translated mode i8042.dumbkbd [HW] Pretend that controller can only read data from keyboard and cannot control its state (Don't attempt to blink the leds) i8042.noaux [HW] Don't check for auxiliary (== mouse) port i8042.nokbd [HW] Don't check/create keyboard port i8042.noloop [HW] Disable the AUX Loopback command while probing for the AUX port i8042.nomux [HW] Don't check presence of an active multiplexing controller i8042.nopnp [HW] Don't use ACPIPnP / PnPBIOS to discover KBD/AUX controllers i8042.notimeout [HW] Ignore timeout condition signalled by conroller i8042.reset [HW] Reset the controller during init and cleanup i8042.unlock [HW] Unlock (ignore) the keylock i810= [HW,DRM] i8k.ignore_dmi [HW] Continue probing hardware even if DMI data indicates that the driver is running on unsupported hardware. i8k.force [HW] Activate i8k driver even if SMM BIOS signature does not match list of supported models. i8k.power_status [HW] Report power status in /proc/i8k (disabled by default) i8k.restricted [HW] Allow controlling fans only if SYS_ADMIN capability is set. icn= [HW,ISDN] Format: [,[,[,]]] ide-core.nodma= [HW] (E)IDE subsystem Format: =0.0 to prevent dma on hda, =0.1 hdb =1.0 hdc .vlb_clock .pci_clock .noflush .nohpa .noprobe .nowerr .cdrom .chs .ignore_cable are additional options See Documentation/ide/ide.txt. ide-pci-generic.all-generic-ide [HW] (E)IDE subsystem Claim all unknown PCI IDE storage controllers. idle= [X86] Format: idle=poll, idle=mwait, idle=halt, idle=nomwait Poll forces a polling idle loop that can slightly improve the performance of waking up a idle CPU, but will use a lot of power and make the system run hot. Not recommended. idle=mwait: On systems which support MONITOR/MWAIT but the kernel chose to not use it because it doesn't save as much power as a normal idle loop, use the MONITOR/MWAIT idle loop anyways. Performance should be the same as idle=poll. idle=halt: Halt is forced to be used for CPU idle. In such case C2/C3 won't be used again. idle=nomwait: Disable mwait for CPU C-states ignore_loglevel [KNL] Ignore loglevel setting - this will print /all/ kernel messages to the console. Useful for debugging. We also add it as printk module parameter, so users could change it dynamically, usually by /sys/module/printk/parameters/ignore_loglevel. ihash_entries= [KNL] Set number of hash buckets for inode cache. ima_audit= [IMA] Format: { "0" | "1" } 0 -- integrity auditing messages. (Default) 1 -- enable informational integrity auditing messages. ima_hash= [IMA] Format: { "sha1" | "md5" } default: "sha1" ima_tcb [IMA] Load a policy which meets the needs of the Trusted Computing Base. This means IMA will measure all programs exec'd, files mmap'd for exec, and all files opened for read by uid=0. init= [KNL] Format: Run specified binary instead of /sbin/init as init process. initcall_debug [KNL] Trace initcalls as they are executed. Useful for working out where the kernel is dying during startup. initrd= [BOOT] Specify the location of the initial ramdisk inport.irq= [HW] Inport (ATI XL and Microsoft) busmouse driver Format: intel_iommu= [DMAR] Intel IOMMU driver (DMAR) option on Enable intel iommu driver. off Disable intel iommu driver. igfx_off [Default Off] By default, gfx is mapped as normal device. If a gfx device has a dedicated DMAR unit, the DMAR unit is bypassed by not enabling DMAR with this option. In this case, gfx device will use physical address for DMA. forcedac [x86_64] With this option iommu will not optimize to look for io virtual address below 32-bit forcing dual address cycle on pci bus for cards supporting greater than 32-bit addressing. The default is to look for translation below 32-bit and if not available then look in the higher range. strict [Default Off] With this option on every unmap_single operation will result in a hardware IOTLB flush operation as opposed to batching them for performance. sp_off [Default Off] By default, super page will be supported if Intel IOMMU has the capability. With this option, super page will not be supported. intel_idle.max_cstate= [KNL,HW,ACPI,X86] 0 disables intel_idle and fall back on acpi_idle. 1 to 6 specify maximum depth of C-state. intremap= [X86-64, Intel-IOMMU] on enable Interrupt Remapping (default) off disable Interrupt Remapping nosid disable Source ID checking no_x2apic_optout BIOS x2APIC opt-out request will be ignored inttest= [IA-64] iomem= Disable strict checking of access to MMIO memory strict regions from userspace. relaxed iommu= [x86] off force noforce biomerge panic nopanic merge nomerge forcesac soft pt [x86, IA-64] group_mf [x86, IA-64] io7= [HW] IO7 for Marvel based alpha systems See comment before marvel_specify_io7 in arch/alpha/kernel/core_marvel.c. io_delay= [X86] I/O delay method 0x80 Standard port 0x80 based delay 0xed Alternate port 0xed based delay (needed on some systems) udelay Simple two microseconds delay none No delay ip= [IP_PNP] See Documentation/filesystems/nfs/nfsroot.txt. ip2= [HW] Set IO/IRQ pairs for up to 4 IntelliPort boards See comment before ip2_setup() in drivers/char/ip2/ip2base.c. irqfixup [HW] When an interrupt is not handled search all handlers for it. Intended to get systems with badly broken firmware running. irqpoll [HW] When an interrupt is not handled search all handlers for it. Also check all handlers each timer interrupt. Intended to get systems with badly broken firmware running. isapnp= [ISAPNP] Format: ,,, isolcpus= [KNL,SMP] Isolate CPUs from the general scheduler. Format: ,..., or - (must be a positive range in ascending order) or a mixture ,...,- This option can be used to specify one or more CPUs to isolate from the general SMP balancing and scheduling algorithms. You can move a process onto or off an "isolated" CPU via the CPU affinity syscalls or cpuset. begins at 0 and the maximum value is "number of CPUs in system - 1". This option is the preferred way to isolate CPUs. The alternative -- manually setting the CPU mask of all tasks in the system -- can cause problems and suboptimal load balancer performance. iucv= [HW,NET] js= [HW,JOY] Analog joystick See Documentation/input/joystick.txt. keepinitrd [HW,ARM] kernelcore=nn[KMG] [KNL,X86,IA-64,PPC] This parameter specifies the amount of memory usable by the kernel for non-movable allocations. The requested amount is spread evenly throughout all nodes in the system. The remaining memory in each node is used for Movable pages. In the event, a node is too small to have both kernelcore and Movable pages, kernelcore pages will take priority and other nodes will have a larger number of kernelcore pages. The Movable zone is used for the allocation of pages that may be reclaimed or moved by the page migration subsystem. This means that HugeTLB pages may not be allocated from this zone. Note that allocations like PTEs-from-HighMem still use the HighMem zone if it exists, and the Normal zone if it does not. kgdbdbgp= [KGDB,HW] kgdb over EHCI usb debug port. Format: [,poll interval] The controller # is the number of the ehci usb debug port as it is probed via PCI. The poll interval is optional and is the number seconds in between each poll cycle to the debug port in case you need the functionality for interrupting the kernel with gdb or control-c on the dbgp connection. When not using this parameter you use sysrq-g to break into the kernel debugger. kgdboc= [KGDB,HW] kgdb over consoles. Requires a tty driver that supports console polling, or a supported polling keyboard driver (non-usb). Serial only format: [,baud] keyboard only format: kbd keyboard and serial format: kbd,[,baud] Optional Kernel mode setting: kms, kbd format: kms,kbd kms, kbd and serial format: kms,kbd,[,baud] kgdbwait [KGDB] Stop kernel execution and enter the kernel debugger at the earliest opportunity. kmac= [MIPS] korina ethernet MAC address. Configure the RouterBoard 532 series on-chip Ethernet adapter MAC address. kmemleak= [KNL] Boot-time kmemleak enable/disable Valid arguments: on, off Default: on kstack=N [X86] Print N words from the kernel stack in oops dumps. kvm.ignore_msrs=[KVM] Ignore guest accesses to unhandled MSRs. Default is 0 (don't ignore, but inject #GP) kvm.mmu_audit= [KVM] This is a R/W parameter which allows audit KVM MMU at runtime. Default is 0 (off) kvm-amd.nested= [KVM,AMD] Allow nested virtualization in KVM/SVM. Default is 1 (enabled) kvm-amd.npt= [KVM,AMD] Disable nested paging (virtualized MMU) for all guests. Default is 1 (enabled) if in 64-bit or 32-bit PAE mode. kvm-intel.ept= [KVM,Intel] Disable extended page tables (virtualized MMU) support on capable Intel chips. Default is 1 (enabled) kvm-intel.emulate_invalid_guest_state= [KVM,Intel] Enable emulation of invalid guest states Default is 0 (disabled) kvm-intel.flexpriority= [KVM,Intel] Disable FlexPriority feature (TPR shadow). Default is 1 (enabled) kvm-intel.nested= [KVM,Intel] Enable VMX nesting (nVMX). Default is 0 (disabled) kvm-intel.unrestricted_guest= [KVM,Intel] Disable unrestricted guest feature (virtualized real and unpaged mode) on capable Intel chips. Default is 1 (enabled) kvm-intel.vpid= [KVM,Intel] Disable Virtual Processor Identification feature (tagged TLBs) on capable Intel chips. Default is 1 (enabled) l2cr= [PPC] l3cr= [PPC] lapic [X86-32,APIC] Enable the local APIC even if BIOS disabled it. lapic_timer_c2_ok [X86,APIC] trust the local apic timer in C2 power state. libata.dma= [LIBATA] DMA control libata.dma=0 Disable all PATA and SATA DMA libata.dma=1 PATA and SATA Disk DMA only libata.dma=2 ATAPI (CDROM) DMA only libata.dma=4 Compact Flash DMA only Combinations also work, so libata.dma=3 enables DMA for disks and CDROMs, but not CFs. libata.ignore_hpa= [LIBATA] Ignore HPA limit libata.ignore_hpa=0 keep BIOS limits (default) libata.ignore_hpa=1 ignore limits, using full disk libata.noacpi [LIBATA] Disables use of ACPI in libata suspend/resume when set. Format: libata.force= [LIBATA] Force configurations. The format is comma separated list of "[ID:]VAL" where ID is PORT[.DEVICE]. PORT and DEVICE are decimal numbers matching port, link or device. Basically, it matches the ATA ID string printed on console by libata. If the whole ID part is omitted, the last PORT and DEVICE values are used. If ID hasn't been specified yet, the configuration applies to all ports, links and devices. If only DEVICE is omitted, the parameter applies to the port and all links and devices behind it. DEVICE number of 0 either selects the first device or the first fan-out link behind PMP device. It does not select the host link. DEVICE number of 15 selects the host link and device attached to it. The VAL specifies the configuration to force. As long as there's no ambiguity shortcut notation is allowed. For example, both 1.5 and 1.5G would work for 1.5Gbps. The following configurations can be forced. * Cable type: 40c, 80c, short40c, unk, ign or sata. Any ID with matching PORT is used. * SATA link speed limit: 1.5Gbps or 3.0Gbps. * Transfer mode: pio[0-7], mwdma[0-4] and udma[0-7]. udma[/][16,25,33,44,66,100,133] notation is also allowed. * [no]ncq: Turn on or off NCQ. * nohrst, nosrst, norst: suppress hard, soft and both resets. * dump_id: dump IDENTIFY data. If there are multiple matching configurations changing the same attribute, the last one is used. memblock=debug [KNL] Enable memblock debug messages. load_ramdisk= [RAM] List of ramdisks to load from floppy See Documentation/blockdev/ramdisk.txt. lockd.nlm_grace_period=P [NFS] Assign grace period. Format: lockd.nlm_tcpport=N [NFS] Assign TCP port. Format: lockd.nlm_timeout=T [NFS] Assign timeout value. Format: lockd.nlm_udpport=M [NFS] Assign UDP port. Format: logibm.irq= [HW,MOUSE] Logitech Bus Mouse Driver Format: loglevel= All Kernel Messages with a loglevel smaller than the console loglevel will be printed to the console. It can also be changed with klogd or other programs. The loglevels are defined as follows: 0 (KERN_EMERG) system is unusable 1 (KERN_ALERT) action must be taken immediately 2 (KERN_CRIT) critical conditions 3 (KERN_ERR) error conditions 4 (KERN_WARNING) warning conditions 5 (KERN_NOTICE) normal but significant condition 6 (KERN_INFO) informational 7 (KERN_DEBUG) debug-level messages log_buf_len=n[KMG] Sets the size of the printk ring buffer, in bytes. n must be a power of two. The default size is set in the kernel config file. logo.nologo [FB] Disables display of the built-in Linux logo. This may be used to provide more screen space for kernel log messages and is useful when debugging kernel boot problems. lp=0 [LP] Specify parallel ports to use, e.g, lp=port[,port...] lp=none,parport0 (lp0 not configured, lp1 uses lp=reset first parallel port). 'lp=0' disables the lp=auto printer driver. 'lp=reset' (which can be specified in addition to the ports) causes attached printers to be reset. Using lp=port1,port2,... specifies the parallel ports to associate lp devices with, starting with lp0. A port specification may be 'none' to skip that lp device, or a parport name such as 'parport0'. Specifying 'lp=auto' instead of a port specification list means that device IDs from each port should be examined, to see if an IEEE 1284-compliant printer is attached; if so, the driver will manage that printer. See also header of drivers/char/lp.c. lpj=n [KNL] Sets loops_per_jiffy to given constant, thus avoiding time-consuming boot-time autodetection (up to 250 ms per CPU). 0 enables autodetection (default). To determine the correct value for your kernel, boot with normal autodetection and see what value is printed. Note that on SMP systems the preset will be applied to all CPUs, which is likely to cause problems if your CPUs need significantly divergent settings. An incorrect value will cause delays in the kernel to be wrong, leading to unpredictable I/O errors and other breakage. Although unlikely, in the extreme case this might damage your hardware. ltpc= [NET] Format: ,, machvec= [IA-64] Force the use of a particular machine-vector (machvec) in a generic kernel. Example: machvec=hpzx1_swiotlb machtype= [Loongson] Share the same kernel image file between different yeeloong laptop. Example: machtype=lemote-yeeloong-2f-7inch max_addr=nn[KMG] [KNL,BOOT,ia64] All physical memory greater than or equal to this physical address is ignored. maxcpus= [SMP] Maximum number of processors that an SMP kernel should make use of. maxcpus=n : n >= 0 limits the kernel to using 'n' processors. n=0 is a special case, it is equivalent to "nosmp", which also disables the IO APIC. max_loop= [LOOP] The number of loop block devices that get (loop.max_loop) unconditionally pre-created at init time. The default number is configured by BLK_DEV_LOOP_MIN_COUNT. Instead of statically allocating a predefined number, loop devices can be requested on-demand with the /dev/loop-control interface. mcatest= [IA-64] mce [X86-32] Machine Check Exception mce=option [X86-64] See Documentation/x86/x86_64/boot-options.txt md= [HW] RAID subsystems devices and level See Documentation/md.txt. mdacon= [MDA] Format: , Specifies range of consoles to be captured by the MDA. mem=nn[KMG] [KNL,BOOT] Force usage of a specific amount of memory Amount of memory to be used when the kernel is not able to see the whole system memory or for test. [X86-32] Use together with memmap= to avoid physical address space collisions. Without memmap= PCI devices could be placed at addresses belonging to unused RAM. mem=nopentium [BUGS=X86-32] Disable usage of 4MB pages for kernel memory. memchunk=nn[KMG] [KNL,SH] Allow user to override the default size for per-device physically contiguous DMA buffers. memmap=exactmap [KNL,X86] Enable setting of an exact E820 memory map, as specified by the user. Such memmap=exactmap lines can be constructed based on BIOS output or other requirements. See the memmap=nn@ss option description. memmap=nn[KMG]@ss[KMG] [KNL] Force usage of a specific region of memory Region of memory to be used, from ss to ss+nn. memmap=nn[KMG]#ss[KMG] [KNL,ACPI] Mark specific memory as ACPI data. Region of memory to be used, from ss to ss+nn. memmap=nn[KMG]$ss[KMG] [KNL,ACPI] Mark specific memory as reserved. Region of memory to be used, from ss to ss+nn. Example: Exclude memory from 0x18690000-0x1869ffff memmap=64K$0x18690000 or memmap=0x10000$0x18690000 memory_corruption_check=0/1 [X86] Some BIOSes seem to corrupt the first 64k of memory when doing things like suspend/resume. Setting this option will scan the memory looking for corruption. Enabling this will both detect corruption and prevent the kernel from using the memory being corrupted. However, its intended as a diagnostic tool; if repeatable BIOS-originated corruption always affects the same memory, you can use memmap= to prevent the kernel from using that memory. memory_corruption_check_size=size [X86] By default it checks for corruption in the low 64k, making this memory unavailable for normal use. Use this parameter to scan for corruption in more or less memory. memory_corruption_check_period=seconds [X86] By default it checks for corruption every 60 seconds. Use this parameter to check at some other rate. 0 disables periodic checking. memtest= [KNL,X86] Enable memtest Format: default : 0 Specifies the number of memtest passes to be performed. Each pass selects another test pattern from a given set of patterns. Memtest fills the memory with this pattern, validates memory contents and reserves bad memory regions that are detected. meye.*= [HW] Set MotionEye Camera parameters See Documentation/video4linux/meye.txt. mfgpt_irq= [IA-32] Specify the IRQ to use for the Multi-Function General Purpose Timers on AMD Geode platforms. mfgptfix [X86-32] Fix MFGPT timers on AMD Geode platforms when the BIOS has incorrectly applied a workaround. TinyBIOS version 0.98 is known to be affected, 0.99 fixes the problem by letting the user disable the workaround. mga= [HW,DRM] min_addr=nn[KMG] [KNL,BOOT,ia64] All physical memory below this physical address is ignored. mini2440= [ARM,HW,KNL] Format:[0..2][b][c][t] Default: "0tb" MINI2440 configuration specification: 0 - The attached screen is the 3.5" TFT 1 - The attached screen is the 7" TFT 2 - The VGA Shield is attached (1024x768) Leaving out the screen size parameter will not load the TFT driver, and the framebuffer will be left unconfigured. b - Enable backlight. The TFT backlight pin will be linked to the kernel VESA blanking code and a GPIO LED. This parameter is not necessary when using the VGA shield. c - Enable the s3c camera interface. t - Reserved for enabling touchscreen support. The touchscreen support is not enabled in the mainstream kernel as of 2.6.30, a preliminary port can be found in the "bleeding edge" mini2440 support kernel at http://repo.or.cz/w/linux-2.6/mini2440.git mminit_loglevel= [KNL] When CONFIG_DEBUG_MEMORY_INIT is set, this parameter allows control of the logging verbosity for the additional memory initialisation checks. A value of 0 disables mminit logging and a level of 4 will log everything. Information is printed at KERN_DEBUG so loglevel=8 may also need to be specified. mousedev.tap_time= [MOUSE] Maximum time between finger touching and leaving touchpad surface for touch to be considered a tap and be reported as a left button click (for touchpads working in absolute mode only). Format: mousedev.xres= [MOUSE] Horizontal screen resolution, used for devices reporting absolute coordinates, such as tablets mousedev.yres= [MOUSE] Vertical screen resolution, used for devices reporting absolute coordinates, such as tablets movablecore=nn[KMG] [KNL,X86,IA-64,PPC] This parameter is similar to kernelcore except it specifies the amount of memory used for migratable allocations. If both kernelcore and movablecore is specified, then kernelcore will be at *least* the specified value but may be more. If movablecore on its own is specified, the administrator must be careful that the amount of memory usable for all allocations is not too small. MTD_Partition= [MTD] Format: ,,, MTD_Region= [MTD] Format: ,[,,,,] mtdparts= [MTD] See drivers/mtd/cmdlinepart.c. multitce=off [PPC] This parameter disables the use of the pSeries firmware feature for updating multiple TCE entries at a time. onenand.bdry= [HW,MTD] Flex-OneNAND Boundary Configuration Format: [die0_boundary][,die0_lock][,die1_boundary][,die1_lock] boundary - index of last SLC block on Flex-OneNAND. The remaining blocks are configured as MLC blocks. lock - Configure if Flex-OneNAND boundary should be locked. Once locked, the boundary cannot be changed. 1 indicates lock status, 0 indicates unlock status. mtdset= [ARM] ARM/S3C2412 JIVE boot control See arch/arm/mach-s3c2412/mach-jive.c mtouchusb.raw_coordinates= [HW] Make the MicroTouch USB driver use raw coordinates ('y', default) or cooked coordinates ('n') mtrr_chunk_size=nn[KMG] [X86] used for mtrr cleanup. It is largest continuous chunk that could hold holes aka. UC entries. mtrr_gran_size=nn[KMG] [X86] Used for mtrr cleanup. It is granularity of mtrr block. Default is 1. Large value could prevent small alignment from using up MTRRs. mtrr_spare_reg_nr=n [X86] Format: Range: 0,7 : spare reg number Default : 1 Used for mtrr cleanup. It is spare mtrr entries number. Set to 2 or more if your graphical card needs more. n2= [NET] SDL Inc. RISCom/N2 synchronous serial card netdev= [NET] Network devices parameters Format: ,,,, Note that mem_start is often overloaded to mean something different and driver-specific. This usage is only documented in each driver source file if at all. nf_conntrack.acct= [NETFILTER] Enable connection tracking flow accounting 0 to disable accounting 1 to enable accounting Default value is 0. nfsaddrs= [NFS] Deprecated. Use ip= instead. See Documentation/filesystems/nfs/nfsroot.txt. nfsroot= [NFS] nfs root filesystem for disk-less boxes. See Documentation/filesystems/nfs/nfsroot.txt. nfsrootdebug [NFS] enable nfsroot debugging messages. See Documentation/filesystems/nfs/nfsroot.txt. nfs.callback_tcpport= [NFS] set the TCP port on which the NFSv4 callback channel should listen. nfs.cache_getent= [NFS] sets the pathname to the program which is used to update the NFS client cache entries. nfs.cache_getent_timeout= [NFS] sets the timeout after which an attempt to update a cache entry is deemed to have failed. nfs.idmap_cache_timeout= [NFS] set the maximum lifetime for idmapper cache entries. nfs.enable_ino64= [NFS] enable 64-bit inode numbers. If zero, the NFS client will fake up a 32-bit inode number for the readdir() and stat() syscalls instead of returning the full 64-bit number. The default is to return 64-bit inode numbers. nfs.nfs4_disable_idmapping= [NFSv4] When set to the default of '1', this option ensures that both the RPC level authentication scheme and the NFS level operations agree to use numeric uids/gids if the mount is using the 'sec=sys' security flavour. In effect it is disabling idmapping, which can make migration from legacy NFSv2/v3 systems to NFSv4 easier. Servers that do not support this mode of operation will be autodetected by the client, and it will fall back to using the idmapper. To turn off this behaviour, set the value to '0'. nmi_debug= [KNL,AVR32,SH] Specify one or more actions to take when a NMI is triggered. Format: [state][,regs][,debounce][,die] nmi_watchdog= [KNL,BUGS=X86] Debugging features for SMP kernels Format: [panic,][nopanic,][num] Valid num: 0 0 - turn nmi_watchdog off When panic is specified, panic when an NMI watchdog timeout occurs (or 'nopanic' to override the opposite default). This is useful when you use a panic=... timeout and need the box quickly up again. netpoll.carrier_timeout= [NET] Specifies amount of time (in seconds) that netpoll should wait for a carrier. By default netpoll waits 4 seconds. no387 [BUGS=X86-32] Tells the kernel to use the 387 maths emulation library even if a 387 maths coprocessor is present. no_console_suspend [HW] Never suspend the console Disable suspending of consoles during suspend and hibernate operations. Once disabled, debugging messages can reach various consoles while the rest of the system is being put to sleep (ie, while debugging driver suspend/resume hooks). This may not work reliably with all consoles, but is known to work with serial and VGA consoles. To facilitate more flexible debugging, we also add console_suspend, a printk module parameter to control it. Users could use console_suspend (usually /sys/module/printk/parameters/console_suspend) to turn on/off it dynamically. noaliencache [MM, NUMA, SLAB] Disables the allocation of alien caches in the slab allocator. Saves per-node memory, but will impact performance. noalign [KNL,ARM] noapic [SMP,APIC] Tells the kernel to not make use of any IOAPICs that may be present in the system. noautogroup Disable scheduler automatic task group creation. nobats [PPC] Do not use BATs for mapping kernel lowmem on "Classic" PPC cores. nocache [ARM] noclflush [BUGS=X86] Don't use the CLFLUSH instruction nodelayacct [KNL] Disable per-task delay accounting nodisconnect [HW,SCSI,M68K] Disables SCSI disconnects. nodsp [SH] Disable hardware DSP at boot time. noefi [X86] Disable EFI runtime services support. noexec [IA-64] noexec [X86] On X86-32 available only on PAE configured kernels. noexec=on: enable non-executable mappings (default) noexec=off: disable non-executable mappings nosmep [X86] Disable SMEP (Supervisor Mode Execution Protection) even if it is supported by processor. noexec32 [X86-64] This affects only 32-bit executables. noexec32=on: enable non-executable mappings (default) read doesn't imply executable mappings noexec32=off: disable non-executable mappings read implies executable mappings nofpu [SH] Disable hardware FPU at boot time. nofxsr [BUGS=X86-32] Disables x86 floating point extended register save and restore. The kernel will only save legacy floating-point registers on task switch. noxsave [BUGS=X86] Disables x86 extended register state save and restore using xsave. The kernel will fallback to enabling legacy floating-point and sse state. nohlt [BUGS=ARM,SH] Tells the kernel that the sleep(SH) or wfi(ARM) instruction doesn't work correctly and not to use it. This is also useful when using JTAG debugger. no-hlt [BUGS=X86-32] Tells the kernel that the hlt instruction doesn't work correctly and not to use it. no_file_caps Tells the kernel not to honor file capabilities. The only way then for a file to be executed with privilege is to be setuid root or executed by root. nohalt [IA-64] Tells the kernel not to use the power saving function PAL_HALT_LIGHT when idle. This increases power-consumption. On the positive side, it reduces interrupt wake-up latency, which may improve performance in certain environments such as networked servers or real-time systems. nohz= [KNL] Boottime enable/disable dynamic ticks Valid arguments: on, off Default: on noiotrap [SH] Disables trapped I/O port accesses. noirqdebug [X86-32] Disables the code which attempts to detect and disable unhandled interrupt sources. no_timer_check [X86,APIC] Disables the code which tests for broken timer IRQ sources. noisapnp [ISAPNP] Disables ISA PnP code. noinitrd [RAM] Tells the kernel not to load any configured initial RAM disk. nointremap [X86-64, Intel-IOMMU] Do not enable interrupt remapping. [Deprecated - use intremap=off] nointroute [IA-64] nojitter [IA-64] Disables jitter checking for ITC timers. no-kvmclock [X86,KVM] Disable paravirtualized KVM clock driver no-kvmapf [X86,KVM] Disable paravirtualized asynchronous page fault handling. no-steal-acc [X86,KVM] Disable paravirtualized steal time accounting. steal time is computed, but won't influence scheduler behaviour nolapic [X86-32,APIC] Do not enable or use the local APIC. nolapic_timer [X86-32,APIC] Do not use the local APIC timer. noltlbs [PPC] Do not use large page/tlb entries for kernel lowmem mapping on PPC40x. nomca [IA-64] Disable machine check abort handling nomce [X86-32] Machine Check Exception nomfgpt [X86-32] Disable Multi-Function General Purpose Timer usage (for AMD Geode machines). nonmi_ipi [X86] Disable using NMI IPIs during panic/reboot to shutdown the other cpus. Instead use the REBOOT_VECTOR irq. nopat [X86] Disable PAT (page attribute table extension of pagetables) support. norandmaps Don't use address space randomization. Equivalent to echo 0 > /proc/sys/kernel/randomize_va_space noreplace-paravirt [X86,IA-64,PV_OPS] Don't patch paravirt_ops noreplace-smp [X86-32,SMP] Don't replace SMP instructions with UP alternatives noresidual [PPC] Don't use residual data on PReP machines. nordrand [X86] Disable the direct use of the RDRAND instruction even if it is supported by the processor. RDRAND is still available to user space applications. noresume [SWSUSP] Disables resume and restores original swap space. no-scroll [VGA] Disables scrollback. This is required for the Braillex ib80-piezo Braille reader made by F.H. Papenmeier (Germany). nosbagart [IA-64] nosep [BUGS=X86-32] Disables x86 SYSENTER/SYSEXIT support. nosmp [SMP] Tells an SMP kernel to act as a UP kernel, and disable the IO APIC. legacy for "maxcpus=0". nosoftlockup [KNL] Disable the soft-lockup detector. nosync [HW,M68K] Disables sync negotiation for all devices. notsc [BUGS=X86-32] Disable Time Stamp Counter nousb [USB] Disable the USB subsystem nowatchdog [KNL] Disable the lockup detector (NMI watchdog). nowb [ARM] nox2apic [X86-64,APIC] Do not enable x2APIC mode. nptcg= [IA-64] Override max number of concurrent global TLB purges which is reported from either PAL_VM_SUMMARY or SAL PALO. nr_cpus= [SMP] Maximum number of processors that an SMP kernel could support. nr_cpus=n : n >= 1 limits the kernel to supporting 'n' processors. Later in runtime you can not use hotplug cpu feature to put more cpu back to online. just like you compile the kernel NR_CPUS=n nr_uarts= [SERIAL] maximum number of UARTs to be registered. numa_zonelist_order= [KNL, BOOT] Select zonelist order for NUMA. one of ['zone', 'node', 'default'] can be specified This can be set from sysctl after boot. See Documentation/sysctl/vm.txt for details. ohci1394_dma=early [HW] enable debugging via the ohci1394 driver. See Documentation/debugging-via-ohci1394.txt for more info. olpc_ec_timeout= [OLPC] ms delay when issuing EC commands Rather than timing out after 20 ms if an EC command is not properly ACKed, override the length of the timeout. We have interrupts disabled while waiting for the ACK, so if this is set too high interrupts *may* be lost! omap_mux= [OMAP] Override bootloader pin multiplexing. Format: ... For example, to override I2C bus2: omap_mux=i2c2_scl.i2c2_scl=0x100,i2c2_sda.i2c2_sda=0x100 oprofile.timer= [HW] Use timer interrupt instead of performance counters oprofile.cpu_type= Force an oprofile cpu type This might be useful if you have an older oprofile userland or if you want common events. Format: { arch_perfmon } arch_perfmon: [X86] Force use of architectural perfmon on Intel CPUs instead of the CPU specific event set. timer: [X86] Force use of architectural NMI timer mode (see also oprofile.timer for generic hr timer mode) [s390] Force legacy basic mode sampling (report cpu_type "timer") oops=panic Always panic on oopses. Default is to just kill the process, but there is a small probability of deadlocking the machine. This will also cause panics on machine check exceptions. Useful together with panic=30 to trigger a reboot. OSS [HW,OSS] See Documentation/sound/oss/oss-parameters.txt panic= [KNL] Kernel behaviour on panic: delay timeout > 0: seconds before rebooting timeout = 0: wait forever timeout < 0: reboot immediately Format: parkbd.port= [HW] Parallel port number the keyboard adapter is connected to, default is 0. Format: parkbd.mode= [HW] Parallel port keyboard adapter mode of operation, 0 for XT, 1 for AT (default is AT). Format: parport= [HW,PPT] Specify parallel ports. 0 disables. Format: { 0 | auto | 0xBBB[,IRQ[,DMA]] } Use 'auto' to force the driver to use any IRQ/DMA settings detected (the default is to ignore detected IRQ/DMA settings because of possible conflicts). You can specify the base address, IRQ, and DMA settings; IRQ and DMA should be numbers, or 'auto' (for using detected settings on that particular port), or 'nofifo' (to avoid using a FIFO even if it is detected). Parallel ports are assigned in the order they are specified on the command line, starting with parport0. parport_init_mode= [HW,PPT] Configure VIA parallel port to operate in a specific mode. This is necessary on Pegasos computer where firmware has no options for setting up parallel port mode and sets it to spp. Currently this function knows 686a and 8231 chips. Format: [spp|ps2|epp|ecp|ecpepp] pause_on_oops= Halt all CPUs after the first oops has been printed for the specified number of seconds. This is to be used if your oopses keep scrolling off the screen. pcbit= [HW,ISDN] pcd. [PARIDE] See header of drivers/block/paride/pcd.c. See also Documentation/blockdev/paride.txt. pci=option[,option...] [PCI] various PCI subsystem options: earlydump [X86] dump PCI config space before the kernel changes anything off [X86] don't probe for the PCI bus bios [X86-32] force use of PCI BIOS, don't access the hardware directly. Use this if your machine has a non-standard PCI host bridge. nobios [X86-32] disallow use of PCI BIOS, only direct hardware access methods are allowed. Use this if you experience crashes upon bootup and you suspect they are caused by the BIOS. conf1 [X86] Force use of PCI Configuration Mechanism 1. conf2 [X86] Force use of PCI Configuration Mechanism 2. noaer [PCIE] If the PCIEAER kernel config parameter is enabled, this kernel boot option can be used to disable the use of PCIE advanced error reporting. nodomains [PCI] Disable support for multiple PCI root domains (aka PCI segments, in ACPI-speak). nommconf [X86] Disable use of MMCONFIG for PCI Configuration check_enable_amd_mmconf [X86] check for and enable properly configured MMIO access to PCI config space on AMD family 10h CPU nomsi [MSI] If the PCI_MSI kernel config parameter is enabled, this kernel boot option can be used to disable the use of MSI interrupts system-wide. noioapicquirk [APIC] Disable all boot interrupt quirks. Safety option to keep boot IRQs enabled. This should never be necessary. ioapicreroute [APIC] Enable rerouting of boot IRQs to the primary IO-APIC for bridges that cannot disable boot IRQs. This fixes a source of spurious IRQs when the system masks IRQs. noioapicreroute [APIC] Disable workaround that uses the boot IRQ equivalent of an IRQ that connects to a chipset where boot IRQs cannot be disabled. The opposite of ioapicreroute. biosirq [X86-32] Use PCI BIOS calls to get the interrupt routing table. These calls are known to be buggy on several machines and they hang the machine when used, but on other computers it's the only way to get the interrupt routing table. Try this option if the kernel is unable to allocate IRQs or discover secondary PCI buses on your motherboard. rom [X86] Assign address space to expansion ROMs. Use with caution as certain devices share address decoders between ROMs and other resources. norom [X86] Do not assign address space to expansion ROMs that do not already have BIOS assigned address ranges. nobar [X86] Do not assign address space to the BARs that weren't assigned by the BIOS. irqmask=0xMMMM [X86] Set a bit mask of IRQs allowed to be assigned automatically to PCI devices. You can make the kernel exclude IRQs of your ISA cards this way. pirqaddr=0xAAAAA [X86] Specify the physical address of the PIRQ table (normally generated by the BIOS) if it is outside the F0000h-100000h range. lastbus=N [X86] Scan all buses thru bus #N. Can be useful if the kernel is unable to find your secondary buses and you want to tell it explicitly which ones they are. assign-busses [X86] Always assign all PCI bus numbers ourselves, overriding whatever the firmware may have done. usepirqmask [X86] Honor the possible IRQ mask stored in the BIOS $PIR table. This is needed on some systems with broken BIOSes, notably some HP Pavilion N5400 and Omnibook XE3 notebooks. This will have no effect if ACPI IRQ routing is enabled. noacpi [X86] Do not use ACPI for IRQ routing or for PCI scanning. use_crs [X86] Use PCI host bridge window information from ACPI. On BIOSes from 2008 or later, this is enabled by default. If you need to use this, please report a bug. nocrs [X86] Ignore PCI host bridge windows from ACPI. If you need to use this, please report a bug. routeirq Do IRQ routing for all PCI devices. This is normally done in pci_enable_device(), so this option is a temporary workaround for broken drivers that don't call it. skip_isa_align [X86] do not align io start addr, so can handle more pci cards firmware [ARM] Do not re-enumerate the bus but instead just use the configuration from the bootloader. This is currently used on IXP2000 systems where the bus has to be configured a certain way for adjunct CPUs. noearly [X86] Don't do any early type 1 scanning. This might help on some broken boards which machine check when some devices' config space is read. But various workarounds are disabled and some IOMMU drivers will not work. bfsort Sort PCI devices into breadth-first order. This sorting is done to get a device order compatible with older (<= 2.4) kernels. nobfsort Don't sort PCI devices into breadth-first order. cbiosize=nn[KMG] The fixed amount of bus space which is reserved for the CardBus bridge's IO window. The default value is 256 bytes. cbmemsize=nn[KMG] The fixed amount of bus space which is reserved for the CardBus bridge's memory window. The default value is 64 megabytes. resource_alignment= Format: [@][:]:.[; ...] Specifies alignment and device to reassign aligned memory resources. If is not specified, PAGE_SIZE is used as alignment. PCI-PCI bridge can be specified, if resource windows need to be expanded. ecrc= Enable/disable PCIe ECRC (transaction layer end-to-end CRC checking). bios: Use BIOS/firmware settings. This is the the default. off: Turn ECRC off on: Turn ECRC on. realloc reallocate PCI resources if allocations done by BIOS are erroneous. pcie_aspm= [PCIE] Forcibly enable or disable PCIe Active State Power Management. off Disable ASPM. force Enable ASPM even on devices that claim not to support it. WARNING: Forcing ASPM on may cause system lockups. pcie_ports= [PCIE] PCIe ports handling: auto Ask the BIOS whether or not to use native PCIe services associated with PCIe ports (PME, hot-plug, AER). Use them only if that is allowed by the BIOS. native Use native PCIe services associated with PCIe ports unconditionally. compat Treat PCIe ports as PCI-to-PCI bridges, disable the PCIe ports driver. pcie_pme= [PCIE,PM] Native PCIe PME signaling options: nomsi Do not use MSI for native PCIe PME signaling (this makes all PCIe root ports use INTx for all services). pcmv= [HW,PCMCIA] BadgePAD 4 pd. [PARIDE] See Documentation/blockdev/paride.txt. pdcchassis= [PARISC,HW] Disable/Enable PDC Chassis Status codes at boot time. Format: { 0 | 1 } See arch/parisc/kernel/pdc_chassis.c percpu_alloc= Select which percpu first chunk allocator to use. Currently supported values are "embed" and "page". Archs may support subset or none of the selections. See comments in mm/percpu.c for details on each allocator. This parameter is primarily for debugging and performance comparison. pf. [PARIDE] See Documentation/blockdev/paride.txt. pg. [PARIDE] See Documentation/blockdev/paride.txt. pirq= [SMP,APIC] Manual mp-table setup See Documentation/x86/i386/IO-APIC.txt. plip= [PPT,NET] Parallel port network link Format: { parport | timid | 0 } See also Documentation/parport.txt. pmtmr= [X86] Manual setup of pmtmr I/O Port. Override pmtimer IOPort with a hex value. e.g. pmtmr=0x508 pnp.debug=1 [PNP] Enable PNP debug messages (depends on the CONFIG_PNP_DEBUG_MESSAGES option). Change at run-time via /sys/module/pnp/parameters/debug. We always show current resource usage; turning this on also shows possible settings and some assignment information. pnpacpi= [ACPI] { off } pnpbios= [ISAPNP] { on | off | curr | res | no-curr | no-res } pnp_reserve_irq= [ISAPNP] Exclude IRQs for the autoconfiguration pnp_reserve_dma= [ISAPNP] Exclude DMAs for the autoconfiguration pnp_reserve_io= [ISAPNP] Exclude I/O ports for the autoconfiguration Ranges are in pairs (I/O port base and size). pnp_reserve_mem= [ISAPNP] Exclude memory regions for the autoconfiguration. Ranges are in pairs (memory base and size). ports= [IP_VS_FTP] IPVS ftp helper module Default is 21. Up to 8 (IP_VS_APP_MAX_PORTS) ports may be specified. Format: ,.... print-fatal-signals= [KNL] debug: print fatal signals If enabled, warn about various signal handling related application anomalies: too many signals, too many POSIX.1 timers, fatal signals causing a coredump - etc. If you hit the warning due to signal overflow, you might want to try "ulimit -i unlimited". default: off. printk.always_kmsg_dump= Trigger kmsg_dump for cases other than kernel oops or panics Format: (1/Y/y=enable, 0/N/n=disable) default: disabled printk.time= Show timing data prefixed to each printk message line Format: (1/Y/y=enable, 0/N/n=disable) processor.max_cstate= [HW,ACPI] Limit processor to maximum C-state max_cstate=9 overrides any DMI blacklist limit. processor.nocst [HW,ACPI] Ignore the _CST method to determine C-states, instead using the legacy FADT method profile= [KNL] Enable kernel profiling via /proc/profile Format: [schedule,] Param: "schedule" - profile schedule points. Param: - step/bucket size as a power of 2 for statistical time based profiling. Param: "sleep" - profile D-state sleeping (millisecs). Requires CONFIG_SCHEDSTATS Param: "kvm" - profile VM exits. prompt_ramdisk= [RAM] List of RAM disks to prompt for floppy disk before loading. See Documentation/blockdev/ramdisk.txt. psmouse.proto= [HW,MOUSE] Highest PS2 mouse protocol extension to probe for; one of (bare|imps|exps|lifebook|any). psmouse.rate= [HW,MOUSE] Set desired mouse report rate, in reports per second. psmouse.resetafter= [HW,MOUSE] Try to reset the device after so many bad packets (0 = never). psmouse.resolution= [HW,MOUSE] Set desired mouse resolution, in dpi. psmouse.smartscroll= [HW,MOUSE] Controls Logitech smartscroll autorepeat. 0 = disabled, 1 = enabled (default). pstore.backend= Specify the name of the pstore backend to use pt. [PARIDE] See Documentation/blockdev/paride.txt. pty.legacy_count= [KNL] Number of legacy pty's. Overwrites compiled-in default number. quiet [KNL] Disable most log messages r128= [HW,DRM] raid= [HW,RAID] See Documentation/md.txt. ramdisk_blocksize= [RAM] See Documentation/blockdev/ramdisk.txt. ramdisk_size= [RAM] Sizes of RAM disks in kilobytes See Documentation/blockdev/ramdisk.txt. rcupdate.blimit= [KNL,BOOT] Set maximum number of finished RCU callbacks to process in one batch. rcupdate.qhimark= [KNL,BOOT] Set threshold of queued RCU callbacks over which batch limiting is disabled. rcupdate.qlowmark= [KNL,BOOT] Set threshold of queued RCU callbacks below which batch limiting is re-enabled. rdinit= [KNL] Format: Run specified binary instead of /init from the ramdisk, used for early userspace startup. See initrd. reboot= [BUGS=X86-32,BUGS=ARM,BUGS=IA-64] Rebooting mode Format: [,[,...]] See arch/*/kernel/reboot.c or arch/*/kernel/process.c relax_domain_level= [KNL, SMP] Set scheduler's default relax_domain_level. See Documentation/cgroups/cpusets.txt. reserve= [KNL,BUGS] Force the kernel to ignore some iomem area reservetop= [X86-32] Format: nn[KMG] Reserves a hole at the top of the kernel virtual address space. reservelow= [X86] Format: nn[K] Set the amount of memory to reserve for BIOS at the bottom of the address space. reset_devices [KNL] Force drivers to reset the underlying device during initialization. resume= [SWSUSP] Specify the partition device for software suspend resume_offset= [SWSUSP] Specify the offset from the beginning of the partition given by "resume=" at which the swap header is located, in units (needed only for swap files). See Documentation/power/swsusp-and-swap-files.txt resumedelay= [HIBERNATION] Delay (in seconds) to pause before attempting to read the resume files resumewait [HIBERNATION] Wait (indefinitely) for resume device to show up. Useful for devices that are detected asynchronously (e.g. USB and MMC devices). hibernate= [HIBERNATION] noresume Don't check if there's a hibernation image present during boot. nocompress Don't compress/decompress hibernation images. retain_initrd [RAM] Keep initrd memory after extraction rhash_entries= [KNL,NET] Set number of hash buckets for route cache riscom8= [HW,SERIAL] Format: [,[,...]] ro [KNL] Mount root device read-only on boot root= [KNL] Root filesystem See name_to_dev_t comment in init/do_mounts.c. rootdelay= [KNL] Delay (in seconds) to pause before attempting to mount the root filesystem rootflags= [KNL] Set root filesystem mount option string rootfstype= [KNL] Set root filesystem type rootwait [KNL] Wait (indefinitely) for root device to show up. Useful for devices that are detected asynchronously (e.g. USB and MMC devices). rw [KNL] Mount root device read-write on boot S [KNL] Run init in single mode sa1100ir [NET] See drivers/net/irda/sa1100_ir.c. sbni= [NET] Granch SBNI12 leased line adapter sched_debug [KNL] Enables verbose scheduler debug messages. security= [SECURITY] Choose a security module to enable at boot. If this boot parameter is not specified, only the first security module asking for security registration will be loaded. An invalid security module name will be treated as if no module has been chosen. selinux= [SELINUX] Disable or enable SELinux at boot time. Format: { "0" | "1" } See security/selinux/Kconfig help text. 0 -- disable. 1 -- enable. Default value is set via kernel config option. If enabled at boot time, /selinux/disable can be used later to disable prior to initial policy load. apparmor= [APPARMOR] Disable or enable AppArmor at boot time Format: { "0" | "1" } See security/apparmor/Kconfig help text 0 -- disable. 1 -- enable. Default value is set via kernel config option. serialnumber [BUGS=X86-32] shapers= [NET] Maximal number of shapers. show_msr= [x86] show boot-time MSR settings Format: { } Show boot-time (BIOS-initialized) MSR settings. The parameter means the number of CPUs to show, for example 1 means boot CPU only. simeth= [IA-64] simscsi= slram= [HW,MTD] slab_max_order= [MM, SLAB] Determines the maximum allowed order for slabs. A high setting may cause OOMs due to memory fragmentation. Defaults to 1 for systems with more than 32MB of RAM, 0 otherwise. slub_debug[=options[,slabs]] [MM, SLUB] Enabling slub_debug allows one to determine the culprit if slab objects become corrupted. Enabling slub_debug can create guard zones around objects and may poison objects when not in use. Also tracks the last alloc / free. For more information see Documentation/vm/slub.txt. slub_max_order= [MM, SLUB] Determines the maximum allowed order for slabs. A high setting may cause OOMs due to memory fragmentation. For more information see Documentation/vm/slub.txt. slub_min_objects= [MM, SLUB] The minimum number of objects per slab. SLUB will increase the slab order up to slub_max_order to generate a sufficiently large slab able to contain the number of objects indicated. The higher the number of objects the smaller the overhead of tracking slabs and the less frequently locks need to be acquired. For more information see Documentation/vm/slub.txt. slub_min_order= [MM, SLUB] Determines the mininum page order for slabs. Must be lower than slub_max_order. For more information see Documentation/vm/slub.txt. slub_nomerge [MM, SLUB] Disable merging of slabs with similar size. May be necessary if there is some reason to distinguish allocs to different slabs. Debug options disable merging on their own. For more information see Documentation/vm/slub.txt. smart2= [HW] Format: [,[,...,]] smp-alt-once [X86-32,SMP] On a hotplug CPU system, only attempt to substitute SMP alternatives once at boot. smsc-ircc2.nopnp [HW] Don't use PNP to discover SMC devices smsc-ircc2.ircc_cfg= [HW] Device configuration I/O port smsc-ircc2.ircc_sir= [HW] SIR base I/O port smsc-ircc2.ircc_fir= [HW] FIR base I/O port smsc-ircc2.ircc_irq= [HW] IRQ line smsc-ircc2.ircc_dma= [HW] DMA channel smsc-ircc2.ircc_transceiver= [HW] Transceiver type: 0: Toshiba Satellite 1800 (GP data pin select) 1: Fast pin select (default) 2: ATC IRMode softlockup_panic= [KNL] Should the soft-lockup detector generate panics. Format: sonypi.*= [HW] Sony Programmable I/O Control Device driver See Documentation/laptops/sonypi.txt specialix= [HW,SERIAL] Specialix multi-serial port adapter See Documentation/serial/specialix.txt. spia_io_base= [HW,MTD] spia_fio_base= spia_pedr= spia_peddr= stacktrace [FTRACE] Enabled the stack tracer on boot up. stacktrace_filter=[function-list] [FTRACE] Limit the functions that the stack tracer will trace at boot up. function-list is a comma separated list of functions. This list can be changed at run time by the stack_trace_filter file in the debugfs tracing directory. Note, this enables stack tracing and the stacktrace above is not needed. sti= [PARISC,HW] Format: Set the STI (builtin display/keyboard on the HP-PARISC machines) console (graphic card) which should be used as the initial boot-console. See also comment in drivers/video/console/sticore.c. sti_font= [HW] See comment in drivers/video/console/sticore.c. stifb= [HW] Format: bpp:[:[:...]] sunrpc.min_resvport= sunrpc.max_resvport= [NFS,SUNRPC] SunRPC servers often require that client requests originate from a privileged port (i.e. a port in the range 0 < portnr < 1024). An administrator who wishes to reserve some of these ports for other uses may adjust the range that the kernel's sunrpc client considers to be privileged using these two parameters to set the minimum and maximum port values. sunrpc.pool_mode= [NFS] Control how the NFS server code allocates CPUs to service thread pools. Depending on how many NICs you have and where their interrupts are bound, this option will affect which CPUs will do NFS serving. Note: this parameter cannot be changed while the NFS server is running. auto the server chooses an appropriate mode automatically using heuristics global a single global pool contains all CPUs percpu one pool for each CPU pernode one pool for each NUMA node (equivalent to global on non-NUMA machines) sunrpc.tcp_slot_table_entries= sunrpc.udp_slot_table_entries= [NFS,SUNRPC] Sets the upper limit on the number of simultaneous RPC calls that can be sent from the client to a server. Increasing these values may allow you to improve throughput, but will also increase the amount of memory reserved for use by the client. swapaccount[=0|1] [KNL] Enable accounting of swap in memory resource controller if no parameter or 1 is given or disable it if 0 is given (See Documentation/cgroups/memory.txt) swiotlb= [IA-64] Number of I/O TLB slabs switches= [HW,M68k] sysfs.deprecated=0|1 [KNL] Enable/disable old style sysfs layout for old udev on older distributions. When this option is enabled very new udev will not work anymore. When this option is disabled (or CONFIG_SYSFS_DEPRECATED not compiled) in older udev will not work anymore. Default depends on CONFIG_SYSFS_DEPRECATED_V2 set in the kernel configuration. sysrq_always_enabled [KNL] Ignore sysrq setting - this boot parameter will neutralize any effect of /proc/sys/kernel/sysrq. Useful for debugging. tdfx= [HW,DRM] test_suspend= [SUSPEND] Specify "mem" (for Suspend-to-RAM) or "standby" (for standby suspend) as the system sleep state to briefly enter during system startup. The system is woken from this state using a wakeup-capable RTC alarm. thash_entries= [KNL,NET] Set number of hash buckets for TCP connection thermal.act= [HW,ACPI] -1: disable all active trip points in all thermal zones : override all lowest active trip points thermal.crt= [HW,ACPI] -1: disable all critical trip points in all thermal zones : override all critical trip points thermal.nocrt= [HW,ACPI] Set to disable actions on ACPI thermal zone critical and hot trip points. thermal.off= [HW,ACPI] 1: disable ACPI thermal control thermal.psv= [HW,ACPI] -1: disable all passive trip points : override all passive trip points to this value thermal.tzp= [HW,ACPI] Specify global default ACPI thermal zone polling rate : poll all this frequency 0: no polling (default) threadirqs [KNL] Force threading of all interrupt handlers except those marked explicitely IRQF_NO_THREAD. topology= [S390] Format: {off | on} Specify if the kernel should make use of the cpu topology information if the hardware supports this. The scheduler will make use of this information and e.g. base its process migration decisions on it. Default is on. tp720= [HW,PS2] tpm_suspend_pcr=[HW,TPM] Format: integer pcr id Specify that at suspend time, the tpm driver should extend the specified pcr with zeros, as a workaround for some chips which fail to flush the last written pcr on TPM_SaveState. This will guarantee that all the other pcrs are saved. trace_buf_size=nn[KMG] [FTRACE] will set tracing buffer size. trace_event=[event-list] [FTRACE] Set and start specified trace events in order to facilitate early boot debugging. See also Documentation/trace/events.txt tsc= Disable clocksource stability checks for TSC. Format: [x86] reliable: mark tsc clocksource as reliable, this disables clocksource verification at runtime, as well as the stability checks done at bootup. Used to enable high-resolution timer mode on older hardware, and in virtualized environment. [x86] noirqtime: Do not use TSC to do irq accounting. Used to run time disable IRQ_TIME_ACCOUNTING on any platforms where RDTSC is slow and this accounting can add overhead. turbografx.map[2|3]= [HW,JOY] TurboGraFX parallel port interface Format: ,,,,,,, See also Documentation/input/joystick-parport.txt udbg-immortal [PPC] When debugging early kernel crashes that happen after console_init() and before a proper console driver takes over, this boot options might help "seeing" what's going on. uhash_entries= [KNL,NET] Set number of hash buckets for UDP/UDP-Lite connections uhci-hcd.ignore_oc= [USB] Ignore overcurrent events (default N). Some badly-designed motherboards generate lots of bogus events, for ports that aren't wired to anything. Set this parameter to avoid log spamming. Note that genuine overcurrent events won't be reported either. unknown_nmi_panic [X86] Cause panic on unknown NMI. usbcore.authorized_default= [USB] Default USB device authorization: (default -1 = authorized except for wireless USB, 0 = not authorized, 1 = authorized) usbcore.autosuspend= [USB] The autosuspend time delay (in seconds) used for newly-detected USB devices (default 2). This is the time required before an idle device will be autosuspended. Devices for which the delay is set to a negative value won't be autosuspended at all. usbcore.usbfs_snoop= [USB] Set to log all usbfs traffic (default 0 = off). usbcore.blinkenlights= [USB] Set to cycle leds on hubs (default 0 = off). usbcore.old_scheme_first= [USB] Start with the old device initialization scheme (default 0 = off). usbcore.usbfs_memory_mb= [USB] Memory limit (in MB) for buffers allocated by usbfs (default = 16, 0 = max = 2047). usbcore.use_both_schemes= [USB] Try the other device initialization scheme if the first one fails (default 1 = enabled). usbcore.initial_descriptor_timeout= [USB] Specifies timeout for the initial 64-byte USB_REQ_GET_DESCRIPTOR request in milliseconds (default 5000 = 5.0 seconds). usbhid.mousepoll= [USBHID] The interval which mice are to be polled at. usb-storage.delay_use= [UMS] The delay in seconds before a new device is scanned for Logical Units (default 5). usb-storage.quirks= [UMS] A list of quirks entries to supplement or override the built-in unusual_devs list. List entries are separated by commas. Each entry has the form VID:PID:Flags where VID and PID are Vendor and Product ID values (4-digit hex numbers) and Flags is a set of characters, each corresponding to a common usb-storage quirk flag as follows: a = SANE_SENSE (collect more than 18 bytes of sense data); b = BAD_SENSE (don't collect more than 18 bytes of sense data); c = FIX_CAPACITY (decrease the reported device capacity by one sector); d = NO_READ_DISC_INFO (don't use READ_DISC_INFO command); e = NO_READ_CAPACITY_16 (don't use READ_CAPACITY_16 command); h = CAPACITY_HEURISTICS (decrease the reported device capacity by one sector if the number is odd); i = IGNORE_DEVICE (don't bind to this device); l = NOT_LOCKABLE (don't try to lock and unlock ejectable media); m = MAX_SECTORS_64 (don't transfer more than 64 sectors = 32 KB at a time); n = INITIAL_READ10 (force a retry of the initial READ(10) command); o = CAPACITY_OK (accept the capacity reported by the device); r = IGNORE_RESIDUE (the device reports bogus residue values); s = SINGLE_LUN (the device has only one Logical Unit); w = NO_WP_DETECT (don't test whether the medium is write-protected). Example: quirks=0419:aaf5:rl,0421:0433:rc user_debug= [KNL,ARM] Format: See arch/arm/Kconfig.debug help text. 1 - undefined instruction events 2 - system calls 4 - invalid data aborts 8 - SIGSEGV faults 16 - SIGBUS faults Example: user_debug=31 userpte= [X86] Flags controlling user PTE allocations. nohigh = do not allocate PTE pages in HIGHMEM regardless of setting of CONFIG_HIGHPTE. vdso= [X86,SH] vdso=2: enable compat VDSO (default with COMPAT_VDSO) vdso=1: enable VDSO (default) vdso=0: disable VDSO mapping vdso32= [X86] vdso32=2: enable compat VDSO (default with COMPAT_VDSO) vdso32=1: enable 32-bit VDSO (default) vdso32=0: disable 32-bit VDSO mapping vector= [IA-64,SMP] vector=percpu: enable percpu vector domain video= [FB] Frame buffer configuration See Documentation/fb/modedb.txt. vga= [BOOT,X86-32] Select a particular video mode See Documentation/x86/boot.txt and Documentation/svga.txt. Use vga=ask for menu. This is actually a boot loader parameter; the value is passed to the kernel using a special protocol. vmalloc=nn[KMG] [KNL,BOOT] Forces the vmalloc area to have an exact size of . This can be used to increase the minimum size (128MB on x86). It can also be used to decrease the size and leave more room for directly mapped kernel RAM. vmhalt= [KNL,S390] Perform z/VM CP command after system halt. Format: vmpanic= [KNL,S390] Perform z/VM CP command after kernel panic. Format: vmpoff= [KNL,S390] Perform z/VM CP command after power off. Format: vsyscall= [X86-64] Controls the behavior of vsyscalls (i.e. calls to fixed addresses of 0xffffffffff600x00 from legacy code). Most statically-linked binaries and older versions of glibc use these calls. Because these functions are at fixed addresses, they make nice targets for exploits that can control RIP. emulate [default] Vsyscalls turn into traps and are emulated reasonably safely. native Vsyscalls are native syscall instructions. This is a little bit faster than trapping and makes a few dynamic recompilers work better than they would in emulation mode. It also makes exploits much easier to write. none Vsyscalls don't work at all. This makes them quite hard to use for exploits but might break your system. vt.cur_default= [VT] Default cursor shape. Format: 0xCCBBAA, where AA, BB, and CC are the same as the parameters of the [?A;B;Cc escape sequence; see VGA-softcursor.txt. Default: 2 = underline. vt.default_blu= [VT] Format: ,,,..., Change the default blue palette of the console. This is a 16-member array composed of values ranging from 0-255. vt.default_grn= [VT] Format: ,,,..., Change the default green palette of the console. This is a 16-member array composed of values ranging from 0-255. vt.default_red= [VT] Format: ,,,..., Change the default red palette of the console. This is a 16-member array composed of values ranging from 0-255. vt.default_utf8= [VT] Format=<0|1> Set system-wide default UTF-8 mode for all tty's. Default is 1, i.e. UTF-8 mode is enabled for all newly opened terminals. vt.global_cursor_default= [VT] Format=<-1|0|1> Set system-wide default for whether a cursor is shown on new VTs. Default is -1, i.e. cursors will be created by default unless overridden by individual drivers. 0 will hide cursors, 1 will display them. watchdog timers [HW,WDT] For information on watchdog timers, see Documentation/watchdog/watchdog-parameters.txt or other driver-specific files in the Documentation/watchdog/ directory. x2apic_phys [X86-64,APIC] Use x2apic physical mode instead of default x2apic cluster mode on platforms supporting x2apic. x86_mrst_timer= [X86-32,APBT] Choose timer option for x86 Moorestown MID platform. Two valid options are apbt timer only and lapic timer plus one apbt timer for broadcast timer. x86_mrst_timer=apbt_only | lapic_and_apbt xd= [HW,XT] Original XT pre-IDE (RLL encoded) disks. xd_geo= See header of drivers/block/xd.c. xen_emul_unplug= [HW,X86,XEN] Unplug Xen emulated devices Format: [unplug0,][unplug1] ide-disks -- unplug primary master IDE devices aux-ide-disks -- unplug non-primary-master IDE devices nics -- unplug network devices all -- unplug all emulated devices (NICs and IDE disks) unnecessary -- unplugging emulated devices is unnecessary even if the host did not respond to the unplug protocol never -- do not unplug even if version check succeeds xirc2ps_cs= [NET,PCMCIA] Format: ,,,,,[,[,[,]]] ______________________________________________________________________ TODO: Add more DRM drivers. GETTING STARTED WITH KMEMCHECK ============================== Vegard Nossum Contents ======== 0. Introduction 1. Downloading 2. Configuring and compiling 3. How to use 3.1. Booting 3.2. Run-time enable/disable 3.3. Debugging 3.4. Annotating false positives 4. Reporting errors 5. Technical description 0. Introduction =============== kmemcheck is a debugging feature for the Linux Kernel. More specifically, it is a dynamic checker that detects and warns about some uses of uninitialized memory. Userspace programmers might be familiar with Valgrind's memcheck. The main difference between memcheck and kmemcheck is that memcheck works for userspace programs only, and kmemcheck works for the kernel only. The implementations are of course vastly different. Because of this, kmemcheck is not as accurate as memcheck, but it turns out to be good enough in practice to discover real programmer errors that the compiler is not able to find through static analysis. Enabling kmemcheck on a kernel will probably slow it down to the extent that the machine will not be usable for normal workloads such as e.g. an interactive desktop. kmemcheck will also cause the kernel to use about twice as much memory as normal. For this reason, kmemcheck is strictly a debugging feature. 1. Downloading ============== As of version 2.6.31-rc1, kmemcheck is included in the mainline kernel. 2. Configuring and compiling ============================ kmemcheck only works for the x86 (both 32- and 64-bit) platform. A number of configuration variables must have specific settings in order for the kmemcheck menu to even appear in "menuconfig". These are: o CONFIG_CC_OPTIMIZE_FOR_SIZE=n This option is located under "General setup" / "Optimize for size". Without this, gcc will use certain optimizations that usually lead to false positive warnings from kmemcheck. An example of this is a 16-bit field in a struct, where gcc may load 32 bits, then discard the upper 16 bits. kmemcheck sees only the 32-bit load, and may trigger a warning for the upper 16 bits (if they're uninitialized). o CONFIG_SLAB=y or CONFIG_SLUB=y This option is located under "General setup" / "Choose SLAB allocator". o CONFIG_FUNCTION_TRACER=n This option is located under "Kernel hacking" / "Tracers" / "Kernel Function Tracer" When function tracing is compiled in, gcc emits a call to another function at the beginning of every function. This means that when the page fault handler is called, the ftrace framework will be called before kmemcheck has had a chance to handle the fault. If ftrace then modifies memory that was tracked by kmemcheck, the result is an endless recursive page fault. o CONFIG_DEBUG_PAGEALLOC=n This option is located under "Kernel hacking" / "Debug page memory allocations". In addition, I highly recommend turning on CONFIG_DEBUG_INFO=y. This is also located under "Kernel hacking". With this, you will be able to get line number information from the kmemcheck warnings, which is extremely valuable in debugging a problem. This option is not mandatory, however, because it slows down the compilation process and produces a much bigger kernel image. Now the kmemcheck menu should be visible (under "Kernel hacking" / "kmemcheck: trap use of uninitialized memory"). Here follows a description of the kmemcheck configuration variables: o CONFIG_KMEMCHECK This must be enabled in order to use kmemcheck at all... o CONFIG_KMEMCHECK_[DISABLED | ENABLED | ONESHOT]_BY_DEFAULT This option controls the status of kmemcheck at boot-time. "Enabled" will enable kmemcheck right from the start, "disabled" will boot the kernel as normal (but with the kmemcheck code compiled in, so it can be enabled at run-time after the kernel has booted), and "one-shot" is a special mode which will turn kmemcheck off automatically after detecting the first use of uninitialized memory. If you are using kmemcheck to actively debug a problem, then you probably want to choose "enabled" here. The one-shot mode is mostly useful in automated test setups because it can prevent floods of warnings and increase the chances of the machine surviving in case something is really wrong. In other cases, the one- shot mode could actually be counter-productive because it would turn itself off at the very first error -- in the case of a false positive too -- and this would come in the way of debugging the specific problem you were interested in. If you would like to use your kernel as normal, but with a chance to enable kmemcheck in case of some problem, it might be a good idea to choose "disabled" here. When kmemcheck is disabled, most of the run- time overhead is not incurred, and the kernel will be almost as fast as normal. o CONFIG_KMEMCHECK_QUEUE_SIZE Select the maximum number of error reports to store in an internal (fixed-size) buffer. Since errors can occur virtually anywhere and in any context, we need a temporary storage area which is guaranteed not to generate any other page faults when accessed. The queue will be emptied as soon as a tasklet may be scheduled. If the queue is full, new error reports will be lost. The default value of 64 is probably fine. If some code produces more than 64 errors within an irqs-off section, then the code is likely to produce many, many more, too, and these additional reports seldom give any more information (the first report is usually the most valuable anyway). This number might have to be adjusted if you are not using serial console or similar to capture the kernel log. If you are using the "dmesg" command to save the log, then getting a lot of kmemcheck warnings might overflow the kernel log itself, and the earlier reports will get lost in that way instead. Try setting this to 10 or so on such a setup. o CONFIG_KMEMCHECK_SHADOW_COPY_SHIFT Select the number of shadow bytes to save along with each entry of the error-report queue. These bytes indicate what parts of an allocation are initialized, uninitialized, etc. and will be displayed when an error is detected to help the debugging of a particular problem. The number entered here is actually the logarithm of the number of bytes that will be saved. So if you pick for example 5 here, kmemcheck will save 2^5 = 32 bytes. The default value should be fine for debugging most problems. It also fits nicely within 80 columns. o CONFIG_KMEMCHECK_PARTIAL_OK This option (when enabled) works around certain GCC optimizations that produce 32-bit reads from 16-bit variables where the upper 16 bits are thrown away afterwards. The default value (enabled) is recommended. This may of course hide some real errors, but disabling it would probably produce a lot of false positives. o CONFIG_KMEMCHECK_BITOPS_OK This option silences warnings that would be generated for bit-field accesses where not all the bits are initialized at the same time. This may also hide some real bugs. This option is probably obsolete, or it should be replaced with the kmemcheck-/bitfield-annotations for the code in question. The default value is therefore fine. Now compile the kernel as usual. 3. How to use ============= 3.1. Booting ============ First some information about the command-line options. There is only one option specific to kmemcheck, and this is called "kmemcheck". It can be used to override the default mode as chosen by the CONFIG_KMEMCHECK_*_BY_DEFAULT option. Its possible settings are: o kmemcheck=0 (disabled) o kmemcheck=1 (enabled) o kmemcheck=2 (one-shot mode) If SLUB debugging has been enabled in the kernel, it may take precedence over kmemcheck in such a way that the slab caches which are under SLUB debugging will not be tracked by kmemcheck. In order to ensure that this doesn't happen (even though it shouldn't by default), use SLUB's boot option "slub_debug", like this: slub_debug=- In fact, this option may also be used for fine-grained control over SLUB vs. kmemcheck. For example, if the command line includes "kmemcheck=1 slub_debug=,dentry", then SLUB debugging will be used only for the "dentry" slab cache, and with kmemcheck tracking all the other caches. This is advanced usage, however, and is not generally recommended. 3.2. Run-time enable/disable ============================ When the kernel has booted, it is possible to enable or disable kmemcheck at run-time. WARNING: This feature is still experimental and may cause false positive warnings to appear. Therefore, try not to use this. If you find that it doesn't work properly (e.g. you see an unreasonable amount of warnings), I will be happy to take bug reports. Use the file /proc/sys/kernel/kmemcheck for this purpose, e.g.: $ echo 0 > /proc/sys/kernel/kmemcheck # disables kmemcheck The numbers are the same as for the kmemcheck= command-line option. 3.3. Debugging ============== A typical report will look something like this: WARNING: kmemcheck: Caught 32-bit read from uninitialized memory (ffff88003e4a2024) 80000000000000000000000000000000000000000088ffff0000000000000000 i i i i u u u u i i i i i i i i u u u u u u u u u u u u u u u u ^ Pid: 1856, comm: ntpdate Not tainted 2.6.29-rc5 #264 945P-A RIP: 0010:[] [] __dequeue_signal+0xc8/0x190 RSP: 0018:ffff88003cdf7d98 EFLAGS: 00210002 RAX: 0000000000000030 RBX: ffff88003d4ea968 RCX: 0000000000000009 RDX: ffff88003e5d6018 RSI: ffff88003e5d6024 RDI: ffff88003cdf7e84 RBP: ffff88003cdf7db8 R08: ffff88003e5d6000 R09: 0000000000000000 R10: 0000000000000080 R11: 0000000000000000 R12: 000000000000000e R13: ffff88003cdf7e78 R14: ffff88003d530710 R15: ffff88003d5a98c8 FS: 0000000000000000(0000) GS:ffff880001982000(0063) knlGS:00000 CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033 CR2: ffff88003f806ea0 CR3: 000000003c036000 CR4: 00000000000006a0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff4ff0 DR7: 0000000000000400 [] dequeue_signal+0x8e/0x170 [] get_signal_to_deliver+0x98/0x390 [] do_notify_resume+0xad/0x7d0 [] int_signal+0x12/0x17 [] 0xffffffffffffffff The single most valuable information in this report is the RIP (or EIP on 32- bit) value. This will help us pinpoint exactly which instruction that caused the warning. If your kernel was compiled with CONFIG_DEBUG_INFO=y, then all we have to do is give this address to the addr2line program, like this: $ addr2line -e vmlinux -i ffffffff8104ede8 arch/x86/include/asm/string_64.h:12 include/asm-generic/siginfo.h:287 kernel/signal.c:380 kernel/signal.c:410 The "-e vmlinux" tells addr2line which file to look in. IMPORTANT: This must be the vmlinux of the kernel that produced the warning in the first place! If not, the line number information will almost certainly be wrong. The "-i" tells addr2line to also print the line numbers of inlined functions. In this case, the flag was very important, because otherwise, it would only have printed the first line, which is just a call to memcpy(), which could be called from a thousand places in the kernel, and is therefore not very useful. These inlined functions would not show up in the stack trace above, simply because the kernel doesn't load the extra debugging information. This technique can of course be used with ordinary kernel oopses as well. In this case, it's the caller of memcpy() that is interesting, and it can be found in include/asm-generic/siginfo.h, line 287: 281 static inline void copy_siginfo(struct siginfo *to, struct siginfo *from) 282 { 283 if (from->si_code < 0) 284 memcpy(to, from, sizeof(*to)); 285 else 286 /* _sigchld is currently the largest know union member */ 287 memcpy(to, from, __ARCH_SI_PREAMBLE_SIZE + sizeof(from->_sifields._sigchld)); 288 } Since this was a read (kmemcheck usually warns about reads only, though it can warn about writes to unallocated or freed memory as well), it was probably the "from" argument which contained some uninitialized bytes. Following the chain of calls, we move upwards to see where "from" was allocated or initialized, kernel/signal.c, line 380: 359 static void collect_signal(int sig, struct sigpending *list, siginfo_t *info) 360 { ... 367 list_for_each_entry(q, &list->list, list) { 368 if (q->info.si_signo == sig) { 369 if (first) 370 goto still_pending; 371 first = q; ... 377 if (first) { 378 still_pending: 379 list_del_init(&first->list); 380 copy_siginfo(info, &first->info); 381 __sigqueue_free(first); ... 392 } 393 } Here, it is &first->info that is being passed on to copy_siginfo(). The variable "first" was found on a list -- passed in as the second argument to collect_signal(). We continue our journey through the stack, to figure out where the item on "list" was allocated or initialized. We move to line 410: 395 static int __dequeue_signal(struct sigpending *pending, sigset_t *mask, 396 siginfo_t *info) 397 { ... 410 collect_signal(sig, pending, info); ... 414 } Now we need to follow the "pending" pointer, since that is being passed on to collect_signal() as "list". At this point, we've run out of lines from the "addr2line" output. Not to worry, we just paste the next addresses from the kmemcheck stack dump, i.e.: [] dequeue_signal+0x8e/0x170 [] get_signal_to_deliver+0x98/0x390 [] do_notify_resume+0xad/0x7d0 [] int_signal+0x12/0x17 $ addr2line -e vmlinux -i ffffffff8104f04e ffffffff81050bd8 \ ffffffff8100b87d ffffffff8100c7b5 kernel/signal.c:446 kernel/signal.c:1806 arch/x86/kernel/signal.c:805 arch/x86/kernel/signal.c:871 arch/x86/kernel/entry_64.S:694 Remember that since these addresses were found on the stack and not as the RIP value, they actually point to the _next_ instruction (they are return addresses). This becomes obvious when we look at the code for line 446: 422 int dequeue_signal(struct task_struct *tsk, sigset_t *mask, siginfo_t *info) 423 { ... 431 signr = __dequeue_signal(&tsk->signal->shared_pending, 432 mask, info); 433 /* 434 * itimer signal ? 435 * 436 * itimers are process shared and we restart periodic 437 * itimers in the signal delivery path to prevent DoS 438 * attacks in the high resolution timer case. This is 439 * compliant with the old way of self restarting 440 * itimers, as the SIGALRM is a legacy signal and only 441 * queued once. Changing the restart behaviour to 442 * restart the timer in the signal dequeue path is 443 * reducing the timer noise on heavy loaded !highres 444 * systems too. 445 */ 446 if (unlikely(signr == SIGALRM)) { ... 489 } So instead of looking at 446, we should be looking at 431, which is the line that executes just before 446. Here we see that what we are looking for is &tsk->signal->shared_pending. Our next task is now to figure out which function that puts items on this "shared_pending" list. A crude, but efficient tool, is git grep: $ git grep -n 'shared_pending' kernel/ ... kernel/signal.c:828: pending = group ? &t->signal->shared_pending : &t->pending; kernel/signal.c:1339: pending = group ? &t->signal->shared_pending : &t->pending; ... There were more results, but none of them were related to list operations, and these were the only assignments. We inspect the line numbers more closely and find that this is indeed where items are being added to the list: 816 static int send_signal(int sig, struct siginfo *info, struct task_struct *t, 817 int group) 818 { ... 828 pending = group ? &t->signal->shared_pending : &t->pending; ... 851 q = __sigqueue_alloc(t, GFP_ATOMIC, (sig < SIGRTMIN && 852 (is_si_special(info) || 853 info->si_code >= 0))); 854 if (q) { 855 list_add_tail(&q->list, &pending->list); ... 890 } and: 1309 int send_sigqueue(struct sigqueue *q, struct task_struct *t, int group) 1310 { .... 1339 pending = group ? &t->signal->shared_pending : &t->pending; 1340 list_add_tail(&q->list, &pending->list); .... 1347 } In the first case, the list element we are looking for, "q", is being returned from the function __sigqueue_alloc(), which looks like an allocation function. Let's take a look at it: 187 static struct sigqueue *__sigqueue_alloc(struct task_struct *t, gfp_t flags, 188 int override_rlimit) 189 { 190 struct sigqueue *q = NULL; 191 struct user_struct *user; 192 193 /* 194 * We won't get problems with the target's UID changing under us 195 * because changing it requires RCU be used, and if t != current, the 196 * caller must be holding the RCU readlock (by way of a spinlock) and 197 * we use RCU protection here 198 */ 199 user = get_uid(__task_cred(t)->user); 200 atomic_inc(&user->sigpending); 201 if (override_rlimit || 202 atomic_read(&user->sigpending) <= 203 t->signal->rlim[RLIMIT_SIGPENDING].rlim_cur) 204 q = kmem_cache_alloc(sigqueue_cachep, flags); 205 if (unlikely(q == NULL)) { 206 atomic_dec(&user->sigpending); 207 free_uid(user); 208 } else { 209 INIT_LIST_HEAD(&q->list); 210 q->flags = 0; 211 q->user = user; 212 } 213 214 return q; 215 } We see that this function initializes q->list, q->flags, and q->user. It seems that now is the time to look at the definition of "struct sigqueue", e.g.: 14 struct sigqueue { 15 struct list_head list; 16 int flags; 17 siginfo_t info; 18 struct user_struct *user; 19 }; And, you might remember, it was a memcpy() on &first->info that caused the warning, so this makes perfect sense. It also seems reasonable to assume that it is the caller of __sigqueue_alloc() that has the responsibility of filling out (initializing) this member. But just which fields of the struct were uninitialized? Let's look at kmemcheck's report again: WARNING: kmemcheck: Caught 32-bit read from uninitialized memory (ffff88003e4a2024) 80000000000000000000000000000000000000000088ffff0000000000000000 i i i i u u u u i i i i i i i i u u u u u u u u u u u u u u u u ^ These first two lines are the memory dump of the memory object itself, and the shadow bytemap, respectively. The memory object itself is in this case &first->info. Just beware that the start of this dump is NOT the start of the object itself! The position of the caret (^) corresponds with the address of the read (ffff88003e4a2024). The shadow bytemap dump legend is as follows: i - initialized u - uninitialized a - unallocated (memory has been allocated by the slab layer, but has not yet been handed off to anybody) f - freed (memory has been allocated by the slab layer, but has been freed by the previous owner) In order to figure out where (relative to the start of the object) the uninitialized memory was located, we have to look at the disassembly. For that, we'll need the RIP address again: RIP: 0010:[] [] __dequeue_signal+0xc8/0x190 $ objdump -d --no-show-raw-insn vmlinux | grep -C 8 ffffffff8104ede8: ffffffff8104edc8: mov %r8,0x8(%r8) ffffffff8104edcc: test %r10d,%r10d ffffffff8104edcf: js ffffffff8104ee88 <__dequeue_signal+0x168> ffffffff8104edd5: mov %rax,%rdx ffffffff8104edd8: mov $0xc,%ecx ffffffff8104eddd: mov %r13,%rdi ffffffff8104ede0: mov $0x30,%eax ffffffff8104ede5: mov %rdx,%rsi ffffffff8104ede8: rep movsl %ds:(%rsi),%es:(%rdi) ffffffff8104edea: test $0x2,%al ffffffff8104edec: je ffffffff8104edf0 <__dequeue_signal+0xd0> ffffffff8104edee: movsw %ds:(%rsi),%es:(%rdi) ffffffff8104edf0: test $0x1,%al ffffffff8104edf2: je ffffffff8104edf5 <__dequeue_signal+0xd5> ffffffff8104edf4: movsb %ds:(%rsi),%es:(%rdi) ffffffff8104edf5: mov %r8,%rdi ffffffff8104edf8: callq ffffffff8104de60 <__sigqueue_free> As expected, it's the "rep movsl" instruction from the memcpy() that causes the warning. We know about REP MOVSL that it uses the register RCX to count the number of remaining iterations. By taking a look at the register dump again (from the kmemcheck report), we can figure out how many bytes were left to copy: RAX: 0000000000000030 RBX: ffff88003d4ea968 RCX: 0000000000000009 By looking at the disassembly, we also see that %ecx is being loaded with the value $0xc just before (ffffffff8104edd8), so we are very lucky. Keep in mind that this is the number of iterations, not bytes. And since this is a "long" operation, we need to multiply by 4 to get the number of bytes. So this means that the uninitialized value was encountered at 4 * (0xc - 0x9) = 12 bytes from the start of the object. We can now try to figure out which field of the "struct siginfo" that was not initialized. This is the beginning of the struct: 40 typedef struct siginfo { 41 int si_signo; 42 int si_errno; 43 int si_code; 44 45 union { .. 92 } _sifields; 93 } siginfo_t; On 64-bit, the int is 4 bytes long, so it must the the union member that has not been initialized. We can verify this using gdb: $ gdb vmlinux ... (gdb) p &((struct siginfo *) 0)->_sifields $1 = (union {...} *) 0x10 Actually, it seems that the union member is located at offset 0x10 -- which means that gcc has inserted 4 bytes of padding between the members si_code and _sifields. We can now get a fuller picture of the memory dump: _----------------------------=> si_code / _--------------------=> (padding) | / _------------=> _sifields(._kill._pid) | | / _----=> _sifields(._kill._uid) | | | / -------|-------|-------|-------| 80000000000000000000000000000000000000000088ffff0000000000000000 i i i i u u u u i i i i i i i i u u u u u u u u u u u u u u u u This allows us to realize another important fact: si_code contains the value 0x80. Remember that x86 is little endian, so the first 4 bytes "80000000" are really the number 0x00000080. With a bit of research, we find that this is actually the constant SI_KERNEL defined in include/asm-generic/siginfo.h: 144 #define SI_KERNEL 0x80 /* sent by the kernel from somewhere */ This macro is used in exactly one place in the x86 kernel: In send_signal() in kernel/signal.c: 816 static int send_signal(int sig, struct siginfo *info, struct task_struct *t, 817 int group) 818 { ... 828 pending = group ? &t->signal->shared_pending : &t->pending; ... 851 q = __sigqueue_alloc(t, GFP_ATOMIC, (sig < SIGRTMIN && 852 (is_si_special(info) || 853 info->si_code >= 0))); 854 if (q) { 855 list_add_tail(&q->list, &pending->list); 856 switch ((unsigned long) info) { ... 865 case (unsigned long) SEND_SIG_PRIV: 866 q->info.si_signo = sig; 867 q->info.si_errno = 0; 868 q->info.si_code = SI_KERNEL; 869 q->info.si_pid = 0; 870 q->info.si_uid = 0; 871 break; ... 890 } Not only does this match with the .si_code member, it also matches the place we found earlier when looking for where siginfo_t objects are enqueued on the "shared_pending" list. So to sum up: It seems that it is the padding introduced by the compiler between two struct fields that is uninitialized, and this gets reported when we do a memcpy() on the struct. This means that we have identified a false positive warning. Normally, kmemcheck will not report uninitialized accesses in memcpy() calls when both the source and destination addresses are tracked. (Instead, we copy the shadow bytemap as well). In this case, the destination address clearly was not tracked. We can dig a little deeper into the stack trace from above: arch/x86/kernel/signal.c:805 arch/x86/kernel/signal.c:871 arch/x86/kernel/entry_64.S:694 And we clearly see that the destination siginfo object is located on the stack: 782 static void do_signal(struct pt_regs *regs) 783 { 784 struct k_sigaction ka; 785 siginfo_t info; ... 804 signr = get_signal_to_deliver(&info, &ka, regs, NULL); ... 854 } And this &info is what eventually gets passed to copy_siginfo() as the destination argument. Now, even though we didn't find an actual error here, the example is still a good one, because it shows how one would go about to find out what the report was all about. 3.4. Annotating false positives =============================== There are a few different ways to make annotations in the source code that will keep kmemcheck from checking and reporting certain allocations. Here they are: o __GFP_NOTRACK_FALSE_POSITIVE This flag can be passed to kmalloc() or kmem_cache_alloc() (therefore also to other functions that end up calling one of these) to indicate that the allocation should not be tracked because it would lead to a false positive report. This is a "big hammer" way of silencing kmemcheck; after all, even if the false positive pertains to particular field in a struct, for example, we will now lose the ability to find (real) errors in other parts of the same struct. Example: /* No warnings will ever trigger on accessing any part of x */ x = kmalloc(sizeof *x, GFP_KERNEL | __GFP_NOTRACK_FALSE_POSITIVE); o kmemcheck_bitfield_begin(name)/kmemcheck_bitfield_end(name) and kmemcheck_annotate_bitfield(ptr, name) The first two of these three macros can be used inside struct definitions to signal, respectively, the beginning and end of a bitfield. Additionally, this will assign the bitfield a name, which is given as an argument to the macros. Having used these markers, one can later use kmemcheck_annotate_bitfield() at the point of allocation, to indicate which parts of the allocation is part of a bitfield. Example: struct foo { int x; kmemcheck_bitfield_begin(flags); int flag_a:1; int flag_b:1; kmemcheck_bitfield_end(flags); int y; }; struct foo *x = kmalloc(sizeof *x); /* No warnings will trigger on accessing the bitfield of x */ kmemcheck_annotate_bitfield(x, flags); Note that kmemcheck_annotate_bitfield() can be used even before the return value of kmalloc() is checked -- in other words, passing NULL as the first argument is legal (and will do nothing). 4. Reporting errors =================== As we have seen, kmemcheck will produce false positive reports. Therefore, it is not very wise to blindly post kmemcheck warnings to mailing lists and maintainers. Instead, I encourage maintainers and developers to find errors in their own code. If you get a warning, you can try to work around it, try to figure out if it's a real error or not, or simply ignore it. Most developers know their own code and will quickly and efficiently determine the root cause of a kmemcheck report. This is therefore also the most efficient way to work with kmemcheck. That said, we (the kmemcheck maintainers) will always be on the lookout for false positives that we can annotate and silence. So whatever you find, please drop us a note privately! Kernel configs and steps to reproduce (if available) are of course a great help too. Happy hacking! 5. Technical description ======================== kmemcheck works by marking memory pages non-present. This means that whenever somebody attempts to access the page, a page fault is generated. The page fault handler notices that the page was in fact only hidden, and so it calls on the kmemcheck code to make further investigations. When the investigations are completed, kmemcheck "shows" the page by marking it present (as it would be under normal circumstances). This way, the interrupted code can continue as usual. But after the instruction has been executed, we should hide the page again, so that we can catch the next access too! Now kmemcheck makes use of a debugging feature of the processor, namely single-stepping. When the processor has finished the one instruction that generated the memory access, a debug exception is raised. From here, we simply hide the page again and continue execution, this time with the single-stepping feature turned off. kmemcheck requires some assistance from the memory allocator in order to work. The memory allocator needs to 1. Tell kmemcheck about newly allocated pages and pages that are about to be freed. This allows kmemcheck to set up and tear down the shadow memory for the pages in question. The shadow memory stores the status of each byte in the allocation proper, e.g. whether it is initialized or uninitialized. 2. Tell kmemcheck which parts of memory should be marked uninitialized. There are actually a few more states, such as "not yet allocated" and "recently freed". If a slab cache is set up using the SLAB_NOTRACK flag, it will never return memory that can take page faults because of kmemcheck. If a slab cache is NOT set up using the SLAB_NOTRACK flag, callers can still request memory with the __GFP_NOTRACK or __GFP_NOTRACK_FALSE_POSITIVE flags. This does not prevent the page faults from occurring, however, but marks the object in question as being initialized so that no warnings will ever be produced for this object. Currently, the SLAB and SLUB allocators are supported by kmemcheck. Kernel Memory Leak Detector =========================== Introduction ------------ Kmemleak provides a way of detecting possible kernel memory leaks in a way similar to a tracing garbage collector (http://en.wikipedia.org/wiki/Garbage_collection_%28computer_science%29#Tracing_garbage_collectors), with the difference that the orphan objects are not freed but only reported via /sys/kernel/debug/kmemleak. A similar method is used by the Valgrind tool (memcheck --leak-check) to detect the memory leaks in user-space applications. Please check DEBUG_KMEMLEAK dependencies in lib/Kconfig.debug for supported architectures. Usage ----- CONFIG_DEBUG_KMEMLEAK in "Kernel hacking" has to be enabled. A kernel thread scans the memory every 10 minutes (by default) and prints the number of new unreferenced objects found. To display the details of all the possible memory leaks: # mount -t debugfs nodev /sys/kernel/debug/ # cat /sys/kernel/debug/kmemleak To trigger an intermediate memory scan: # echo scan > /sys/kernel/debug/kmemleak To clear the list of all current possible memory leaks: # echo clear > /sys/kernel/debug/kmemleak New leaks will then come up upon reading /sys/kernel/debug/kmemleak again. Note that the orphan objects are listed in the order they were allocated and one object at the beginning of the list may cause other subsequent objects to be reported as orphan. Memory scanning parameters can be modified at run-time by writing to the /sys/kernel/debug/kmemleak file. The following parameters are supported: off - disable kmemleak (irreversible) stack=on - enable the task stacks scanning (default) stack=off - disable the tasks stacks scanning scan=on - start the automatic memory scanning thread (default) scan=off - stop the automatic memory scanning thread scan= - set the automatic memory scanning period in seconds (default 600, 0 to stop the automatic scanning) scan - trigger a memory scan clear - clear list of current memory leak suspects, done by marking all current reported unreferenced objects grey dump= - dump information about the object found at Kmemleak can also be disabled at boot-time by passing "kmemleak=off" on the kernel command line. Memory may be allocated or freed before kmemleak is initialised and these actions are stored in an early log buffer. The size of this buffer is configured via the CONFIG_DEBUG_KMEMLEAK_EARLY_LOG_SIZE option. Basic Algorithm --------------- The memory allocations via kmalloc, vmalloc, kmem_cache_alloc and friends are traced and the pointers, together with additional information like size and stack trace, are stored in a prio search tree. The corresponding freeing function calls are tracked and the pointers removed from the kmemleak data structures. An allocated block of memory is considered orphan if no pointer to its start address or to any location inside the block can be found by scanning the memory (including saved registers). This means that there might be no way for the kernel to pass the address of the allocated block to a freeing function and therefore the block is considered a memory leak. The scanning algorithm steps: 1. mark all objects as white (remaining white objects will later be considered orphan) 2. scan the memory starting with the data section and stacks, checking the values against the addresses stored in the prio search tree. If a pointer to a white object is found, the object is added to the gray list 3. scan the gray objects for matching addresses (some white objects can become gray and added at the end of the gray list) until the gray set is finished 4. the remaining white objects are considered orphan and reported via /sys/kernel/debug/kmemleak Some allocated memory blocks have pointers stored in the kernel's internal data structures and they cannot be detected as orphans. To avoid this, kmemleak can also store the number of values pointing to an address inside the block address range that need to be found so that the block is not considered a leak. One example is __vmalloc(). Testing specific sections with kmemleak --------------------------------------- Upon initial bootup your /sys/kernel/debug/kmemleak output page may be quite extensive. This can also be the case if you have very buggy code when doing development. To work around these situations you can use the 'clear' command to clear all reported unreferenced objects from the /sys/kernel/debug/kmemleak output. By issuing a 'scan' after a 'clear' you can find new unreferenced objects; this should help with testing specific sections of code. To test a critical section on demand with a clean kmemleak do: # echo clear > /sys/kernel/debug/kmemleak ... test your kernel or modules ... # echo scan > /sys/kernel/debug/kmemleak Then as usual to get your report with: # cat /sys/kernel/debug/kmemleak Kmemleak API ------------ See the include/linux/kmemleak.h header for the functions prototype. kmemleak_init - initialize kmemleak kmemleak_alloc - notify of a memory block allocation kmemleak_alloc_percpu - notify of a percpu memory block allocation kmemleak_free - notify of a memory block freeing kmemleak_free_part - notify of a partial memory block freeing kmemleak_free_percpu - notify of a percpu memory block freeing kmemleak_not_leak - mark an object as not a leak kmemleak_ignore - do not scan or report an object as leak kmemleak_scan_area - add scan areas inside a memory block kmemleak_no_scan - do not scan a memory block kmemleak_erase - erase an old value in a pointer variable kmemleak_alloc_recursive - as kmemleak_alloc but checks the recursiveness kmemleak_free_recursive - as kmemleak_free but checks the recursiveness Dealing with false positives/negatives -------------------------------------- The false negatives are real memory leaks (orphan objects) but not reported by kmemleak because values found during the memory scanning point to such objects. To reduce the number of false negatives, kmemleak provides the kmemleak_ignore, kmemleak_scan_area, kmemleak_no_scan and kmemleak_erase functions (see above). The task stacks also increase the amount of false negatives and their scanning is not enabled by default. The false positives are objects wrongly reported as being memory leaks (orphan). For objects known not to be leaks, kmemleak provides the kmemleak_not_leak function. The kmemleak_ignore could also be used if the memory block is known not to contain other pointers and it will no longer be scanned. Some of the reported leaks are only transient, especially on SMP systems, because of pointers temporarily stored in CPU registers or stacks. Kmemleak defines MSECS_MIN_AGE (defaulting to 1000) representing the minimum age of an object to be reported as a memory leak. Limitations and Drawbacks ------------------------- The main drawback is the reduced performance of memory allocation and freeing. To avoid other penalties, the memory scanning is only performed when the /sys/kernel/debug/kmemleak file is read. Anyway, this tool is intended for debugging purposes where the performance might not be the most important requirement. To keep the algorithm simple, kmemleak scans for values pointing to any address inside a block's address range. This may lead to an increased number of false negatives. However, it is likely that a real memory leak will eventually become visible. Another source of false negatives is the data stored in non-pointer values. In a future version, kmemleak could only scan the pointer members in the allocated structures. This feature would solve many of the false negative cases described above. The tool can report false positives. These are cases where an allocated block doesn't need to be freed (some cases in the init_call functions), the pointer is calculated by other methods than the usual container_of macro or the pointer is stored in a location not scanned by kmemleak. Page allocations and ioremap are not tracked. Everything you never wanted to know about kobjects, ksets, and ktypes Greg Kroah-Hartman Based on an original article by Jon Corbet for lwn.net written October 1, 2003 and located at http://lwn.net/Articles/51437/ Last updated December 19, 2007 Part of the difficulty in understanding the driver model - and the kobject abstraction upon which it is built - is that there is no obvious starting place. Dealing with kobjects requires understanding a few different types, all of which make reference to each other. In an attempt to make things easier, we'll take a multi-pass approach, starting with vague terms and adding detail as we go. To that end, here are some quick definitions of some terms we will be working with. - A kobject is an object of type struct kobject. Kobjects have a name and a reference count. A kobject also has a parent pointer (allowing objects to be arranged into hierarchies), a specific type, and, usually, a representation in the sysfs virtual filesystem. Kobjects are generally not interesting on their own; instead, they are usually embedded within some other structure which contains the stuff the code is really interested in. No structure should EVER have more than one kobject embedded within it. If it does, the reference counting for the object is sure to be messed up and incorrect, and your code will be buggy. So do not do this. - A ktype is the type of object that embeds a kobject. Every structure that embeds a kobject needs a corresponding ktype. The ktype controls what happens to the kobject when it is created and destroyed. - A kset is a group of kobjects. These kobjects can be of the same ktype or belong to different ktypes. The kset is the basic container type for collections of kobjects. Ksets contain their own kobjects, but you can safely ignore that implementation detail as the kset core code handles this kobject automatically. When you see a sysfs directory full of other directories, generally each of those directories corresponds to a kobject in the same kset. We'll look at how to create and manipulate all of these types. A bottom-up approach will be taken, so we'll go back to kobjects. Embedding kobjects It is rare for kernel code to create a standalone kobject, with one major exception explained below. Instead, kobjects are used to control access to a larger, domain-specific object. To this end, kobjects will be found embedded in other structures. If you are used to thinking of things in object-oriented terms, kobjects can be seen as a top-level, abstract class from which other classes are derived. A kobject implements a set of capabilities which are not particularly useful by themselves, but which are nice to have in other objects. The C language does not allow for the direct expression of inheritance, so other techniques - such as structure embedding - must be used. (As an aside, for those familiar with the kernel linked list implementation, this is analogous as to how "list_head" structs are rarely useful on their own, but are invariably found embedded in the larger objects of interest.) So, for example, the UIO code in drivers/uio/uio.c has a structure that defines the memory region associated with a uio device: struct uio_map { struct kobject kobj; struct uio_mem *mem; }; If you have a struct uio_map structure, finding its embedded kobject is just a matter of using the kobj member. Code that works with kobjects will often have the opposite problem, however: given a struct kobject pointer, what is the pointer to the containing structure? You must avoid tricks (such as assuming that the kobject is at the beginning of the structure) and, instead, use the container_of() macro, found in : container_of(pointer, type, member) where: * "pointer" is the pointer to the embedded kobject, * "type" is the type of the containing structure, and * "member" is the name of the structure field to which "pointer" points. The return value from container_of() is a pointer to the corresponding container type. So, for example, a pointer "kp" to a struct kobject embedded *within* a struct uio_map could be converted to a pointer to the *containing* uio_map structure with: struct uio_map *u_map = container_of(kp, struct uio_map, kobj); For convenience, programmers often define a simple macro for "back-casting" kobject pointers to the containing type. Exactly this happens in the earlier drivers/uio/uio.c, as you can see here: struct uio_map { struct kobject kobj; struct uio_mem *mem; }; #define to_map(map) container_of(map, struct uio_map, kobj) where the macro argument "map" is a pointer to the struct kobject in question. That macro is subsequently invoked with: struct uio_map *map = to_map(kobj); Initialization of kobjects Code which creates a kobject must, of course, initialize that object. Some of the internal fields are setup with a (mandatory) call to kobject_init(): void kobject_init(struct kobject *kobj, struct kobj_type *ktype); The ktype is required for a kobject to be created properly, as every kobject must have an associated kobj_type. After calling kobject_init(), to register the kobject with sysfs, the function kobject_add() must be called: int kobject_add(struct kobject *kobj, struct kobject *parent, const char *fmt, ...); This sets up the parent of the kobject and the name for the kobject properly. If the kobject is to be associated with a specific kset, kobj->kset must be assigned before calling kobject_add(). If a kset is associated with a kobject, then the parent for the kobject can be set to NULL in the call to kobject_add() and then the kobject's parent will be the kset itself. As the name of the kobject is set when it is added to the kernel, the name of the kobject should never be manipulated directly. If you must change the name of the kobject, call kobject_rename(): int kobject_rename(struct kobject *kobj, const char *new_name); kobject_rename does not perform any locking or have a solid notion of what names are valid so the caller must provide their own sanity checking and serialization. There is a function called kobject_set_name() but that is legacy cruft and is being removed. If your code needs to call this function, it is incorrect and needs to be fixed. To properly access the name of the kobject, use the function kobject_name(): const char *kobject_name(const struct kobject * kobj); There is a helper function to both initialize and add the kobject to the kernel at the same time, called surprisingly enough kobject_init_and_add(): int kobject_init_and_add(struct kobject *kobj, struct kobj_type *ktype, struct kobject *parent, const char *fmt, ...); The arguments are the same as the individual kobject_init() and kobject_add() functions described above. Uevents After a kobject has been registered with the kobject core, you need to announce to the world that it has been created. This can be done with a call to kobject_uevent(): int kobject_uevent(struct kobject *kobj, enum kobject_action action); Use the KOBJ_ADD action for when the kobject is first added to the kernel. This should be done only after any attributes or children of the kobject have been initialized properly, as userspace will instantly start to look for them when this call happens. When the kobject is removed from the kernel (details on how to do that is below), the uevent for KOBJ_REMOVE will be automatically created by the kobject core, so the caller does not have to worry about doing that by hand. Reference counts One of the key functions of a kobject is to serve as a reference counter for the object in which it is embedded. As long as references to the object exist, the object (and the code which supports it) must continue to exist. The low-level functions for manipulating a kobject's reference counts are: struct kobject *kobject_get(struct kobject *kobj); void kobject_put(struct kobject *kobj); A successful call to kobject_get() will increment the kobject's reference counter and return the pointer to the kobject. When a reference is released, the call to kobject_put() will decrement the reference count and, possibly, free the object. Note that kobject_init() sets the reference count to one, so the code which sets up the kobject will need to do a kobject_put() eventually to release that reference. Because kobjects are dynamic, they must not be declared statically or on the stack, but instead, always allocated dynamically. Future versions of the kernel will contain a run-time check for kobjects that are created statically and will warn the developer of this improper usage. If all that you want to use a kobject for is to provide a reference counter for your structure, please use the struct kref instead; a kobject would be overkill. For more information on how to use struct kref, please see the file Documentation/kref.txt in the Linux kernel source tree. Creating "simple" kobjects Sometimes all that a developer wants is a way to create a simple directory in the sysfs hierarchy, and not have to mess with the whole complication of ksets, show and store functions, and other details. This is the one exception where a single kobject should be created. To create such an entry, use the function: struct kobject *kobject_create_and_add(char *name, struct kobject *parent); This function will create a kobject and place it in sysfs in the location underneath the specified parent kobject. To create simple attributes associated with this kobject, use: int sysfs_create_file(struct kobject *kobj, struct attribute *attr); or int sysfs_create_group(struct kobject *kobj, struct attribute_group *grp); Both types of attributes used here, with a kobject that has been created with the kobject_create_and_add(), can be of type kobj_attribute, so no special custom attribute is needed to be created. See the example module, samples/kobject/kobject-example.c for an implementation of a simple kobject and attributes. ktypes and release methods One important thing still missing from the discussion is what happens to a kobject when its reference count reaches zero. The code which created the kobject generally does not know when that will happen; if it did, there would be little point in using a kobject in the first place. Even predictable object lifecycles become more complicated when sysfs is brought in as other portions of the kernel can get a reference on any kobject that is registered in the system. The end result is that a structure protected by a kobject cannot be freed before its reference count goes to zero. The reference count is not under the direct control of the code which created the kobject. So that code must be notified asynchronously whenever the last reference to one of its kobjects goes away. Once you registered your kobject via kobject_add(), you must never use kfree() to free it directly. The only safe way is to use kobject_put(). It is good practice to always use kobject_put() after kobject_init() to avoid errors creeping in. This notification is done through a kobject's release() method. Usually such a method has a form like: void my_object_release(struct kobject *kobj) { struct my_object *mine = container_of(kobj, struct my_object, kobj); /* Perform any additional cleanup on this object, then... */ kfree(mine); } One important point cannot be overstated: every kobject must have a release() method, and the kobject must persist (in a consistent state) until that method is called. If these constraints are not met, the code is flawed. Note that the kernel will warn you if you forget to provide a release() method. Do not try to get rid of this warning by providing an "empty" release function; you will be mocked mercilessly by the kobject maintainer if you attempt this. Note, the name of the kobject is available in the release function, but it must NOT be changed within this callback. Otherwise there will be a memory leak in the kobject core, which makes people unhappy. Interestingly, the release() method is not stored in the kobject itself; instead, it is associated with the ktype. So let us introduce struct kobj_type: struct kobj_type { void (*release)(struct kobject *); const struct sysfs_ops *sysfs_ops; struct attribute **default_attrs; }; This structure is used to describe a particular type of kobject (or, more correctly, of containing object). Every kobject needs to have an associated kobj_type structure; a pointer to that structure must be specified when you call kobject_init() or kobject_init_and_add(). The release field in struct kobj_type is, of course, a pointer to the release() method for this type of kobject. The other two fields (sysfs_ops and default_attrs) control how objects of this type are represented in sysfs; they are beyond the scope of this document. The default_attrs pointer is a list of default attributes that will be automatically created for any kobject that is registered with this ktype. ksets A kset is merely a collection of kobjects that want to be associated with each other. There is no restriction that they be of the same ktype, but be very careful if they are not. A kset serves these functions: - It serves as a bag containing a group of objects. A kset can be used by the kernel to track "all block devices" or "all PCI device drivers." - A kset is also a subdirectory in sysfs, where the associated kobjects with the kset can show up. Every kset contains a kobject which can be set up to be the parent of other kobjects; the top-level directories of the sysfs hierarchy are constructed in this way. - Ksets can support the "hotplugging" of kobjects and influence how uevent events are reported to user space. In object-oriented terms, "kset" is the top-level container class; ksets contain their own kobject, but that kobject is managed by the kset code and should not be manipulated by any other user. A kset keeps its children in a standard kernel linked list. Kobjects point back to their containing kset via their kset field. In almost all cases, the kobjects belonging to a kset have that kset (or, strictly, its embedded kobject) in their parent. As a kset contains a kobject within it, it should always be dynamically created and never declared statically or on the stack. To create a new kset use: struct kset *kset_create_and_add(const char *name, struct kset_uevent_ops *u, struct kobject *parent); When you are finished with the kset, call: void kset_unregister(struct kset *kset); to destroy it. An example of using a kset can be seen in the samples/kobject/kset-example.c file in the kernel tree. If a kset wishes to control the uevent operations of the kobjects associated with it, it can use the struct kset_uevent_ops to handle it: struct kset_uevent_ops { int (*filter)(struct kset *kset, struct kobject *kobj); const char *(*name)(struct kset *kset, struct kobject *kobj); int (*uevent)(struct kset *kset, struct kobject *kobj, struct kobj_uevent_env *env); }; The filter function allows a kset to prevent a uevent from being emitted to userspace for a specific kobject. If the function returns 0, the uevent will not be emitted. The name function will be called to override the default name of the kset that the uevent sends to userspace. By default, the name will be the same as the kset itself, but this function, if present, can override that name. The uevent function will be called when the uevent is about to be sent to userspace to allow more environment variables to be added to the uevent. One might ask how, exactly, a kobject is added to a kset, given that no functions which perform that function have been presented. The answer is that this task is handled by kobject_add(). When a kobject is passed to kobject_add(), its kset member should point to the kset to which the kobject will belong. kobject_add() will handle the rest. If the kobject belonging to a kset has no parent kobject set, it will be added to the kset's directory. Not all members of a kset do necessarily live in the kset directory. If an explicit parent kobject is assigned before the kobject is added, the kobject is registered with the kset, but added below the parent kobject. Kobject removal After a kobject has been registered with the kobject core successfully, it must be cleaned up when the code is finished with it. To do that, call kobject_put(). By doing this, the kobject core will automatically clean up all of the memory allocated by this kobject. If a KOBJ_ADD uevent has been sent for the object, a corresponding KOBJ_REMOVE uevent will be sent, and any other sysfs housekeeping will be handled for the caller properly. If you need to do a two-stage delete of the kobject (say you are not allowed to sleep when you need to destroy the object), then call kobject_del() which will unregister the kobject from sysfs. This makes the kobject "invisible", but it is not cleaned up, and the reference count of the object is still the same. At a later time call kobject_put() to finish the cleanup of the memory associated with the kobject. kobject_del() can be used to drop the reference to the parent object, if circular references are constructed. It is valid in some cases, that a parent objects references a child. Circular references _must_ be broken with an explicit call to kobject_del(), so that a release functions will be called, and the objects in the former circle release each other. Example code to copy from For a more complete example of using ksets and kobjects properly, see the example programs samples/kobject/{kobject-example.c,kset-example.c}, which will be built as loadable modules if you select CONFIG_SAMPLE_KOBJECT. Title : Kernel Probes (Kprobes) Authors : Jim Keniston : Prasanna S Panchamukhi : Masami Hiramatsu CONTENTS 1. Concepts: Kprobes, Jprobes, Return Probes 2. Architectures Supported 3. Configuring Kprobes 4. API Reference 5. Kprobes Features and Limitations 6. Probe Overhead 7. TODO 8. Kprobes Example 9. Jprobes Example 10. Kretprobes Example Appendix A: The kprobes debugfs interface Appendix B: The kprobes sysctl interface 1. Concepts: Kprobes, Jprobes, Return Probes Kprobes enables you to dynamically break into any kernel routine and collect debugging and performance information non-disruptively. You can trap at almost any kernel code address, specifying a handler routine to be invoked when the breakpoint is hit. There are currently three types of probes: kprobes, jprobes, and kretprobes (also called return probes). A kprobe can be inserted on virtually any instruction in the kernel. A jprobe is inserted at the entry to a kernel function, and provides convenient access to the function's arguments. A return probe fires when a specified function returns. In the typical case, Kprobes-based instrumentation is packaged as a kernel module. The module's init function installs ("registers") one or more probes, and the exit function unregisters them. A registration function such as register_kprobe() specifies where the probe is to be inserted and what handler is to be called when the probe is hit. There are also register_/unregister_*probes() functions for batch registration/unregistration of a group of *probes. These functions can speed up unregistration process when you have to unregister a lot of probes at once. The next four subsections explain how the different types of probes work and how jump optimization works. They explain certain things that you'll need to know in order to make the best use of Kprobes -- e.g., the difference between a pre_handler and a post_handler, and how to use the maxactive and nmissed fields of a kretprobe. But if you're in a hurry to start using Kprobes, you can skip ahead to section 2. 1.1 How Does a Kprobe Work? When a kprobe is registered, Kprobes makes a copy of the probed instruction and replaces the first byte(s) of the probed instruction with a breakpoint instruction (e.g., int3 on i386 and x86_64). When a CPU hits the breakpoint instruction, a trap occurs, the CPU's registers are saved, and control passes to Kprobes via the notifier_call_chain mechanism. Kprobes executes the "pre_handler" associated with the kprobe, passing the handler the addresses of the kprobe struct and the saved registers. Next, Kprobes single-steps its copy of the probed instruction. (It would be simpler to single-step the actual instruction in place, but then Kprobes would have to temporarily remove the breakpoint instruction. This would open a small time window when another CPU could sail right past the probepoint.) After the instruction is single-stepped, Kprobes executes the "post_handler," if any, that is associated with the kprobe. Execution then continues with the instruction following the probepoint. 1.2 How Does a Jprobe Work? A jprobe is implemented using a kprobe that is placed on a function's entry point. It employs a simple mirroring principle to allow seamless access to the probed function's arguments. The jprobe handler routine should have the same signature (arg list and return type) as the function being probed, and must always end by calling the Kprobes function jprobe_return(). Here's how it works. When the probe is hit, Kprobes makes a copy of the saved registers and a generous portion of the stack (see below). Kprobes then points the saved instruction pointer at the jprobe's handler routine, and returns from the trap. As a result, control passes to the handler, which is presented with the same register and stack contents as the probed function. When it is done, the handler calls jprobe_return(), which traps again to restore the original stack contents and processor state and switch to the probed function. By convention, the callee owns its arguments, so gcc may produce code that unexpectedly modifies that portion of the stack. This is why Kprobes saves a copy of the stack and restores it after the jprobe handler has run. Up to MAX_STACK_SIZE bytes are copied -- e.g., 64 bytes on i386. Note that the probed function's args may be passed on the stack or in registers. The jprobe will work in either case, so long as the handler's prototype matches that of the probed function. 1.3 Return Probes 1.3.1 How Does a Return Probe Work? When you call register_kretprobe(), Kprobes establishes a kprobe at the entry to the function. When the probed function is called and this probe is hit, Kprobes saves a copy of the return address, and replaces the return address with the address of a "trampoline." The trampoline is an arbitrary piece of code -- typically just a nop instruction. At boot time, Kprobes registers a kprobe at the trampoline. When the probed function executes its return instruction, control passes to the trampoline and that probe is hit. Kprobes' trampoline handler calls the user-specified return handler associated with the kretprobe, then sets the saved instruction pointer to the saved return address, and that's where execution resumes upon return from the trap. While the probed function is executing, its return address is stored in an object of type kretprobe_instance. Before calling register_kretprobe(), the user sets the maxactive field of the kretprobe struct to specify how many instances of the specified function can be probed simultaneously. register_kretprobe() pre-allocates the indicated number of kretprobe_instance objects. For example, if the function is non-recursive and is called with a spinlock held, maxactive = 1 should be enough. If the function is non-recursive and can never relinquish the CPU (e.g., via a semaphore or preemption), NR_CPUS should be enough. If maxactive <= 0, it is set to a default value. If CONFIG_PREEMPT is enabled, the default is max(10, 2*NR_CPUS). Otherwise, the default is NR_CPUS. It's not a disaster if you set maxactive too low; you'll just miss some probes. In the kretprobe struct, the nmissed field is set to zero when the return probe is registered, and is incremented every time the probed function is entered but there is no kretprobe_instance object available for establishing the return probe. 1.3.2 Kretprobe entry-handler Kretprobes also provides an optional user-specified handler which runs on function entry. This handler is specified by setting the entry_handler field of the kretprobe struct. Whenever the kprobe placed by kretprobe at the function entry is hit, the user-defined entry_handler, if any, is invoked. If the entry_handler returns 0 (success) then a corresponding return handler is guaranteed to be called upon function return. If the entry_handler returns a non-zero error then Kprobes leaves the return address as is, and the kretprobe has no further effect for that particular function instance. Multiple entry and return handler invocations are matched using the unique kretprobe_instance object associated with them. Additionally, a user may also specify per return-instance private data to be part of each kretprobe_instance object. This is especially useful when sharing private data between corresponding user entry and return handlers. The size of each private data object can be specified at kretprobe registration time by setting the data_size field of the kretprobe struct. This data can be accessed through the data field of each kretprobe_instance object. In case probed function is entered but there is no kretprobe_instance object available, then in addition to incrementing the nmissed count, the user entry_handler invocation is also skipped. 1.4 How Does Jump Optimization Work? If your kernel is built with CONFIG_OPTPROBES=y (currently this flag is automatically set 'y' on x86/x86-64, non-preemptive kernel) and the "debug.kprobes_optimization" kernel parameter is set to 1 (see sysctl(8)), Kprobes tries to reduce probe-hit overhead by using a jump instruction instead of a breakpoint instruction at each probepoint. 1.4.1 Init a Kprobe When a probe is registered, before attempting this optimization, Kprobes inserts an ordinary, breakpoint-based kprobe at the specified address. So, even if it's not possible to optimize this particular probepoint, there'll be a probe there. 1.4.2 Safety Check Before optimizing a probe, Kprobes performs the following safety checks: - Kprobes verifies that the region that will be replaced by the jump instruction (the "optimized region") lies entirely within one function. (A jump instruction is multiple bytes, and so may overlay multiple instructions.) - Kprobes analyzes the entire function and verifies that there is no jump into the optimized region. Specifically: - the function contains no indirect jump; - the function contains no instruction that causes an exception (since the fixup code triggered by the exception could jump back into the optimized region -- Kprobes checks the exception tables to verify this); and - there is no near jump to the optimized region (other than to the first byte). - For each instruction in the optimized region, Kprobes verifies that the instruction can be executed out of line. 1.4.3 Preparing Detour Buffer Next, Kprobes prepares a "detour" buffer, which contains the following instruction sequence: - code to push the CPU's registers (emulating a breakpoint trap) - a call to the trampoline code which calls user's probe handlers. - code to restore registers - the instructions from the optimized region - a jump back to the original execution path. 1.4.4 Pre-optimization After preparing the detour buffer, Kprobes verifies that none of the following situations exist: - The probe has either a break_handler (i.e., it's a jprobe) or a post_handler. - Other instructions in the optimized region are probed. - The probe is disabled. In any of the above cases, Kprobes won't start optimizing the probe. Since these are temporary situations, Kprobes tries to start optimizing it again if the situation is changed. If the kprobe can be optimized, Kprobes enqueues the kprobe to an optimizing list, and kicks the kprobe-optimizer workqueue to optimize it. If the to-be-optimized probepoint is hit before being optimized, Kprobes returns control to the original instruction path by setting the CPU's instruction pointer to the copied code in the detour buffer -- thus at least avoiding the single-step. 1.4.5 Optimization The Kprobe-optimizer doesn't insert the jump instruction immediately; rather, it calls synchronize_sched() for safety first, because it's possible for a CPU to be interrupted in the middle of executing the optimized region(*). As you know, synchronize_sched() can ensure that all interruptions that were active when synchronize_sched() was called are done, but only if CONFIG_PREEMPT=n. So, this version of kprobe optimization supports only kernels with CONFIG_PREEMPT=n.(**) After that, the Kprobe-optimizer calls stop_machine() to replace the optimized region with a jump instruction to the detour buffer, using text_poke_smp(). 1.4.6 Unoptimization When an optimized kprobe is unregistered, disabled, or blocked by another kprobe, it will be unoptimized. If this happens before the optimization is complete, the kprobe is just dequeued from the optimized list. If the optimization has been done, the jump is replaced with the original code (except for an int3 breakpoint in the first byte) by using text_poke_smp(). (*)Please imagine that the 2nd instruction is interrupted and then the optimizer replaces the 2nd instruction with the jump *address* while the interrupt handler is running. When the interrupt returns to original address, there is no valid instruction, and it causes an unexpected result. (**)This optimization-safety checking may be replaced with the stop-machine method that ksplice uses for supporting a CONFIG_PREEMPT=y kernel. NOTE for geeks: The jump optimization changes the kprobe's pre_handler behavior. Without optimization, the pre_handler can change the kernel's execution path by changing regs->ip and returning 1. However, when the probe is optimized, that modification is ignored. Thus, if you want to tweak the kernel's execution path, you need to suppress optimization, using one of the following techniques: - Specify an empty function for the kprobe's post_handler or break_handler. or - Execute 'sysctl -w debug.kprobes_optimization=n' 2. Architectures Supported Kprobes, jprobes, and return probes are implemented on the following architectures: - i386 (Supports jump optimization) - x86_64 (AMD-64, EM64T) (Supports jump optimization) - ppc64 - ia64 (Does not support probes on instruction slot1.) - sparc64 (Return probes not yet implemented.) - arm - ppc - mips 3. Configuring Kprobes When configuring the kernel using make menuconfig/xconfig/oldconfig, ensure that CONFIG_KPROBES is set to "y". Under "Instrumentation Support", look for "Kprobes". So that you can load and unload Kprobes-based instrumentation modules, make sure "Loadable module support" (CONFIG_MODULES) and "Module unloading" (CONFIG_MODULE_UNLOAD) are set to "y". Also make sure that CONFIG_KALLSYMS and perhaps even CONFIG_KALLSYMS_ALL are set to "y", since kallsyms_lookup_name() is used by the in-kernel kprobe address resolution code. If you need to insert a probe in the middle of a function, you may find it useful to "Compile the kernel with debug info" (CONFIG_DEBUG_INFO), so you can use "objdump -d -l vmlinux" to see the source-to-object code mapping. 4. API Reference The Kprobes API includes a "register" function and an "unregister" function for each type of probe. The API also includes "register_*probes" and "unregister_*probes" functions for (un)registering arrays of probes. Here are terse, mini-man-page specifications for these functions and the associated probe handlers that you'll write. See the files in the samples/kprobes/ sub-directory for examples. 4.1 register_kprobe #include int register_kprobe(struct kprobe *kp); Sets a breakpoint at the address kp->addr. When the breakpoint is hit, Kprobes calls kp->pre_handler. After the probed instruction is single-stepped, Kprobe calls kp->post_handler. If a fault occurs during execution of kp->pre_handler or kp->post_handler, or during single-stepping of the probed instruction, Kprobes calls kp->fault_handler. Any or all handlers can be NULL. If kp->flags is set KPROBE_FLAG_DISABLED, that kp will be registered but disabled, so, its handlers aren't hit until calling enable_kprobe(kp). NOTE: 1. With the introduction of the "symbol_name" field to struct kprobe, the probepoint address resolution will now be taken care of by the kernel. The following will now work: kp.symbol_name = "symbol_name"; (64-bit powerpc intricacies such as function descriptors are handled transparently) 2. Use the "offset" field of struct kprobe if the offset into the symbol to install a probepoint is known. This field is used to calculate the probepoint. 3. Specify either the kprobe "symbol_name" OR the "addr". If both are specified, kprobe registration will fail with -EINVAL. 4. With CISC architectures (such as i386 and x86_64), the kprobes code does not validate if the kprobe.addr is at an instruction boundary. Use "offset" with caution. register_kprobe() returns 0 on success, or a negative errno otherwise. User's pre-handler (kp->pre_handler): #include #include int pre_handler(struct kprobe *p, struct pt_regs *regs); Called with p pointing to the kprobe associated with the breakpoint, and regs pointing to the struct containing the registers saved when the breakpoint was hit. Return 0 here unless you're a Kprobes geek. User's post-handler (kp->post_handler): #include #include void post_handler(struct kprobe *p, struct pt_regs *regs, unsigned long flags); p and regs are as described for the pre_handler. flags always seems to be zero. User's fault-handler (kp->fault_handler): #include #include int fault_handler(struct kprobe *p, struct pt_regs *regs, int trapnr); p and regs are as described for the pre_handler. trapnr is the architecture-specific trap number associated with the fault (e.g., on i386, 13 for a general protection fault or 14 for a page fault). Returns 1 if it successfully handled the exception. 4.2 register_jprobe #include int register_jprobe(struct jprobe *jp) Sets a breakpoint at the address jp->kp.addr, which must be the address of the first instruction of a function. When the breakpoint is hit, Kprobes runs the handler whose address is jp->entry. The handler should have the same arg list and return type as the probed function; and just before it returns, it must call jprobe_return(). (The handler never actually returns, since jprobe_return() returns control to Kprobes.) If the probed function is declared asmlinkage or anything else that affects how args are passed, the handler's declaration must match. register_jprobe() returns 0 on success, or a negative errno otherwise. 4.3 register_kretprobe #include int register_kretprobe(struct kretprobe *rp); Establishes a return probe for the function whose address is rp->kp.addr. When that function returns, Kprobes calls rp->handler. You must set rp->maxactive appropriately before you call register_kretprobe(); see "How Does a Return Probe Work?" for details. register_kretprobe() returns 0 on success, or a negative errno otherwise. User's return-probe handler (rp->handler): #include #include int kretprobe_handler(struct kretprobe_instance *ri, struct pt_regs *regs); regs is as described for kprobe.pre_handler. ri points to the kretprobe_instance object, of which the following fields may be of interest: - ret_addr: the return address - rp: points to the corresponding kretprobe object - task: points to the corresponding task struct - data: points to per return-instance private data; see "Kretprobe entry-handler" for details. The regs_return_value(regs) macro provides a simple abstraction to extract the return value from the appropriate register as defined by the architecture's ABI. The handler's return value is currently ignored. 4.4 unregister_*probe #include void unregister_kprobe(struct kprobe *kp); void unregister_jprobe(struct jprobe *jp); void unregister_kretprobe(struct kretprobe *rp); Removes the specified probe. The unregister function can be called at any time after the probe has been registered. NOTE: If the functions find an incorrect probe (ex. an unregistered probe), they clear the addr field of the probe. 4.5 register_*probes #include int register_kprobes(struct kprobe **kps, int num); int register_kretprobes(struct kretprobe **rps, int num); int register_jprobes(struct jprobe **jps, int num); Registers each of the num probes in the specified array. If any error occurs during registration, all probes in the array, up to the bad probe, are safely unregistered before the register_*probes function returns. - kps/rps/jps: an array of pointers to *probe data structures - num: the number of the array entries. NOTE: You have to allocate(or define) an array of pointers and set all of the array entries before using these functions. 4.6 unregister_*probes #include void unregister_kprobes(struct kprobe **kps, int num); void unregister_kretprobes(struct kretprobe **rps, int num); void unregister_jprobes(struct jprobe **jps, int num); Removes each of the num probes in the specified array at once. NOTE: If the functions find some incorrect probes (ex. unregistered probes) in the specified array, they clear the addr field of those incorrect probes. However, other probes in the array are unregistered correctly. 4.7 disable_*probe #include int disable_kprobe(struct kprobe *kp); int disable_kretprobe(struct kretprobe *rp); int disable_jprobe(struct jprobe *jp); Temporarily disables the specified *probe. You can enable it again by using enable_*probe(). You must specify the probe which has been registered. 4.8 enable_*probe #include int enable_kprobe(struct kprobe *kp); int enable_kretprobe(struct kretprobe *rp); int enable_jprobe(struct jprobe *jp); Enables *probe which has been disabled by disable_*probe(). You must specify the probe which has been registered. 5. Kprobes Features and Limitations Kprobes allows multiple probes at the same address. Currently, however, there cannot be multiple jprobes on the same function at the same time. Also, a probepoint for which there is a jprobe or a post_handler cannot be optimized. So if you install a jprobe, or a kprobe with a post_handler, at an optimized probepoint, the probepoint will be unoptimized automatically. In general, you can install a probe anywhere in the kernel. In particular, you can probe interrupt handlers. Known exceptions are discussed in this section. The register_*probe functions will return -EINVAL if you attempt to install a probe in the code that implements Kprobes (mostly kernel/kprobes.c and arch/*/kernel/kprobes.c, but also functions such as do_page_fault and notifier_call_chain). If you install a probe in an inline-able function, Kprobes makes no attempt to chase down all inline instances of the function and install probes there. gcc may inline a function without being asked, so keep this in mind if you're not seeing the probe hits you expect. A probe handler can modify the environment of the probed function -- e.g., by modifying kernel data structures, or by modifying the contents of the pt_regs struct (which are restored to the registers upon return from the breakpoint). So Kprobes can be used, for example, to install a bug fix or to inject faults for testing. Kprobes, of course, has no way to distinguish the deliberately injected faults from the accidental ones. Don't drink and probe. Kprobes makes no attempt to prevent probe handlers from stepping on each other -- e.g., probing printk() and then calling printk() from a probe handler. If a probe handler hits a probe, that second probe's handlers won't be run in that instance, and the kprobe.nmissed member of the second probe will be incremented. As of Linux v2.6.15-rc1, multiple handlers (or multiple instances of the same handler) may run concurrently on different CPUs. Kprobes does not use mutexes or allocate memory except during registration and unregistration. Probe handlers are run with preemption disabled. Depending on the architecture and optimization state, handlers may also run with interrupts disabled (e.g., kretprobe handlers and optimized kprobe handlers run without interrupt disabled on x86/x86-64). In any case, your handler should not yield the CPU (e.g., by attempting to acquire a semaphore). Since a return probe is implemented by replacing the return address with the trampoline's address, stack backtraces and calls to __builtin_return_address() will typically yield the trampoline's address instead of the real return address for kretprobed functions. (As far as we can tell, __builtin_return_address() is used only for instrumentation and error reporting.) If the number of times a function is called does not match the number of times it returns, registering a return probe on that function may produce undesirable results. In such a case, a line: kretprobe BUG!: Processing kretprobe d000000000041aa8 @ c00000000004f48c gets printed. With this information, one will be able to correlate the exact instance of the kretprobe that caused the problem. We have the do_exit() case covered. do_execve() and do_fork() are not an issue. We're unaware of other specific cases where this could be a problem. If, upon entry to or exit from a function, the CPU is running on a stack other than that of the current task, registering a return probe on that function may produce undesirable results. For this reason, Kprobes doesn't support return probes (or kprobes or jprobes) on the x86_64 version of __switch_to(); the registration functions return -EINVAL. On x86/x86-64, since the Jump Optimization of Kprobes modifies instructions widely, there are some limitations to optimization. To explain it, we introduce some terminology. Imagine a 3-instruction sequence consisting of a two 2-byte instructions and one 3-byte instruction. IA | [-2][-1][0][1][2][3][4][5][6][7] [ins1][ins2][ ins3 ] [<- DCR ->] [<- JTPR ->] ins1: 1st Instruction ins2: 2nd Instruction ins3: 3rd Instruction IA: Insertion Address JTPR: Jump Target Prohibition Region DCR: Detoured Code Region The instructions in DCR are copied to the out-of-line buffer of the kprobe, because the bytes in DCR are replaced by a 5-byte jump instruction. So there are several limitations. a) The instructions in DCR must be relocatable. b) The instructions in DCR must not include a call instruction. c) JTPR must not be targeted by any jump or call instruction. d) DCR must not straddle the border between functions. Anyway, these limitations are checked by the in-kernel instruction decoder, so you don't need to worry about that. 6. Probe Overhead On a typical CPU in use in 2005, a kprobe hit takes 0.5 to 1.0 microseconds to process. Specifically, a benchmark that hits the same probepoint repeatedly, firing a simple handler each time, reports 1-2 million hits per second, depending on the architecture. A jprobe or return-probe hit typically takes 50-75% longer than a kprobe hit. When you have a return probe set on a function, adding a kprobe at the entry to that function adds essentially no overhead. Here are sample overhead figures (in usec) for different architectures. k = kprobe; j = jprobe; r = return probe; kr = kprobe + return probe on same function; jr = jprobe + return probe on same function i386: Intel Pentium M, 1495 MHz, 2957.31 bogomips k = 0.57 usec; j = 1.00; r = 0.92; kr = 0.99; jr = 1.40 x86_64: AMD Opteron 246, 1994 MHz, 3971.48 bogomips k = 0.49 usec; j = 0.76; r = 0.80; kr = 0.82; jr = 1.07 ppc64: POWER5 (gr), 1656 MHz (SMT disabled, 1 virtual CPU per physical CPU) k = 0.77 usec; j = 1.31; r = 1.26; kr = 1.45; jr = 1.99 6.1 Optimized Probe Overhead Typically, an optimized kprobe hit takes 0.07 to 0.1 microseconds to process. Here are sample overhead figures (in usec) for x86 architectures. k = unoptimized kprobe, b = boosted (single-step skipped), o = optimized kprobe, r = unoptimized kretprobe, rb = boosted kretprobe, ro = optimized kretprobe. i386: Intel(R) Xeon(R) E5410, 2.33GHz, 4656.90 bogomips k = 0.80 usec; b = 0.33; o = 0.05; r = 1.10; rb = 0.61; ro = 0.33 x86-64: Intel(R) Xeon(R) E5410, 2.33GHz, 4656.90 bogomips k = 0.99 usec; b = 0.43; o = 0.06; r = 1.24; rb = 0.68; ro = 0.30 7. TODO a. SystemTap (http://sourceware.org/systemtap): Provides a simplified programming interface for probe-based instrumentation. Try it out. b. Kernel return probes for sparc64. c. Support for other architectures. d. User-space probes. e. Watchpoint probes (which fire on data references). 8. Kprobes Example See samples/kprobes/kprobe_example.c 9. Jprobes Example See samples/kprobes/jprobe_example.c 10. Kretprobes Example See samples/kprobes/kretprobe_example.c For additional information on Kprobes, refer to the following URLs: http://www-106.ibm.com/developerworks/library/l-kprobes.html?ca=dgr-lnxw42Kprobe http://www.redhat.com/magazine/005mar05/features/kprobes/ http://www-users.cs.umn.edu/~boutcher/kprobes/ http://www.linuxsymposium.org/2006/linuxsymposium_procv2.pdf (pages 101-115) Appendix A: The kprobes debugfs interface With recent kernels (> 2.6.20) the list of registered kprobes is visible under the /sys/kernel/debug/kprobes/ directory (assuming debugfs is mounted at //sys/kernel/debug). /sys/kernel/debug/kprobes/list: Lists all registered probes on the system c015d71a k vfs_read+0x0 c011a316 j do_fork+0x0 c03dedc5 r tcp_v4_rcv+0x0 The first column provides the kernel address where the probe is inserted. The second column identifies the type of probe (k - kprobe, r - kretprobe and j - jprobe), while the third column specifies the symbol+offset of the probe. If the probed function belongs to a module, the module name is also specified. Following columns show probe status. If the probe is on a virtual address that is no longer valid (module init sections, module virtual addresses that correspond to modules that've been unloaded), such probes are marked with [GONE]. If the probe is temporarily disabled, such probes are marked with [DISABLED]. If the probe is optimized, it is marked with [OPTIMIZED]. /sys/kernel/debug/kprobes/enabled: Turn kprobes ON/OFF forcibly. Provides a knob to globally and forcibly turn registered kprobes ON or OFF. By default, all kprobes are enabled. By echoing "0" to this file, all registered probes will be disarmed, till such time a "1" is echoed to this file. Note that this knob just disarms and arms all kprobes and doesn't change each probe's disabling state. This means that disabled kprobes (marked [DISABLED]) will be not enabled if you turn ON all kprobes by this knob. Appendix B: The kprobes sysctl interface /proc/sys/debug/kprobes-optimization: Turn kprobes optimization ON/OFF. When CONFIG_OPTPROBES=y, this sysctl interface appears and it provides a knob to globally and forcibly turn jump optimization (see section 1.4) ON or OFF. By default, jump optimization is allowed (ON). If you echo "0" to this file or set "debug.kprobes_optimization" to 0 via sysctl, all optimized probes will be unoptimized, and any new probes registered after that will not be optimized. Note that this knob *changes* the optimized state. This means that optimized probes (marked [OPTIMIZED]) will be unoptimized ([OPTIMIZED] tag will be removed). If the knob is turned on, they will be optimized again. krefs allow you to add reference counters to your objects. If you have objects that are used in multiple places and passed around, and you don't have refcounts, your code is almost certainly broken. If you want refcounts, krefs are the way to go. To use a kref, add one to your data structures like: struct my_data { . . struct kref refcount; . . }; The kref can occur anywhere within the data structure. You must initialize the kref after you allocate it. To do this, call kref_init as so: struct my_data *data; data = kmalloc(sizeof(*data), GFP_KERNEL); if (!data) return -ENOMEM; kref_init(&data->refcount); This sets the refcount in the kref to 1. Once you have an initialized kref, you must follow the following rules: 1) If you make a non-temporary copy of a pointer, especially if it can be passed to another thread of execution, you must increment the refcount with kref_get() before passing it off: kref_get(&data->refcount); If you already have a valid pointer to a kref-ed structure (the refcount cannot go to zero) you may do this without a lock. 2) When you are done with a pointer, you must call kref_put(): kref_put(&data->refcount, data_release); If this is the last reference to the pointer, the release routine will be called. If the code never tries to get a valid pointer to a kref-ed structure without already holding a valid pointer, it is safe to do this without a lock. 3) If the code attempts to gain a reference to a kref-ed structure without already holding a valid pointer, it must serialize access where a kref_put() cannot occur during the kref_get(), and the structure must remain valid during the kref_get(). For example, if you allocate some data and then pass it to another thread to process: void data_release(struct kref *ref) { struct my_data *data = container_of(ref, struct my_data, refcount); kfree(data); } void more_data_handling(void *cb_data) { struct my_data *data = cb_data; . . do stuff with data here . kref_put(&data->refcount, data_release); } int my_data_handler(void) { int rv = 0; struct my_data *data; struct task_struct *task; data = kmalloc(sizeof(*data), GFP_KERNEL); if (!data) return -ENOMEM; kref_init(&data->refcount); kref_get(&data->refcount); task = kthread_run(more_data_handling, data, "more_data_handling"); if (task == ERR_PTR(-ENOMEM)) { rv = -ENOMEM; goto out; } . . do stuff with data here . out: kref_put(&data->refcount, data_release); return rv; } This way, it doesn't matter what order the two threads handle the data, the kref_put() handles knowing when the data is not referenced any more and releasing it. The kref_get() does not require a lock, since we already have a valid pointer that we own a refcount for. The put needs no lock because nothing tries to get the data without already holding a pointer. Note that the "before" in rule 1 is very important. You should never do something like: task = kthread_run(more_data_handling, data, "more_data_handling"); if (task == ERR_PTR(-ENOMEM)) { rv = -ENOMEM; goto out; } else /* BAD BAD BAD - get is after the handoff */ kref_get(&data->refcount); Don't assume you know what you are doing and use the above construct. First of all, you may not know what you are doing. Second, you may know what you are doing (there are some situations where locking is involved where the above may be legal) but someone else who doesn't know what they are doing may change the code or copy the code. It's bad style. Don't do it. There are some situations where you can optimize the gets and puts. For instance, if you are done with an object and enqueuing it for something else or passing it off to something else, there is no reason to do a get then a put: /* Silly extra get and put */ kref_get(&obj->ref); enqueue(obj); kref_put(&obj->ref, obj_cleanup); Just do the enqueue. A comment about this is always welcome: enqueue(obj); /* We are done with obj, so we pass our refcount off to the queue. DON'T TOUCH obj AFTER HERE! */ The last rule (rule 3) is the nastiest one to handle. Say, for instance, you have a list of items that are each kref-ed, and you wish to get the first one. You can't just pull the first item off the list and kref_get() it. That violates rule 3 because you are not already holding a valid pointer. You must add a mutex (or some other lock). For instance: static DEFINE_MUTEX(mutex); static LIST_HEAD(q); struct my_data { struct kref refcount; struct list_head link; }; static struct my_data *get_entry() { struct my_data *entry = NULL; mutex_lock(&mutex); if (!list_empty(&q)) { entry = container_of(q.next, struct my_data, link); kref_get(&entry->refcount); } mutex_unlock(&mutex); return entry; } static void release_entry(struct kref *ref) { struct my_data *entry = container_of(ref, struct my_data, refcount); list_del(&entry->link); kfree(entry); } static void put_entry(struct my_data *entry) { mutex_lock(&mutex); kref_put(&entry->refcount, release_entry); mutex_unlock(&mutex); } The kref_put() return value is useful if you do not want to hold the lock during the whole release operation. Say you didn't want to call kfree() with the lock held in the example above (since it is kind of pointless to do so). You could use kref_put() as follows: static void release_entry(struct kref *ref) { /* All work is done after the return from kref_put(). */ } static void put_entry(struct my_data *entry) { mutex_lock(&mutex); if (kref_put(&entry->refcount, release_entry)) { list_del(&entry->link); mutex_unlock(&mutex); kfree(entry); } else mutex_unlock(&mutex); } This is really more useful if you have to call other routines as part of the free operations that could take a long time or might claim the same lock. Note that doing everything in the release routine is still preferred as it is a little neater. Corey Minyard A lot of this was lifted from Greg Kroah-Hartman's 2004 OLS paper and presentation on krefs, which can be found at: http://www.kroah.com/linux/talks/ols_2004_kref_paper/Reprint-Kroah-Hartman-OLS2004.pdf and: http://www.kroah.com/linux/talks/ols_2004_kref_talk/ LDM - Logical Disk Manager (Dynamic Disks) ------------------------------------------ Originally Written by FlatCap - Richard Russon . Last Updated by Anton Altaparmakov on 30 March 2007 for Windows Vista. Overview -------- Windows 2000, XP, and Vista use a new partitioning scheme. It is a complete replacement for the MSDOS style partitions. It stores its information in a 1MiB journalled database at the end of the physical disk. The size of partitions is limited only by disk space. The maximum number of partitions is nearly 2000. Any partitions created under the LDM are called "Dynamic Disks". There are no longer any primary or extended partitions. Normal MSDOS style partitions are now known as Basic Disks. If you wish to use Spanned, Striped, Mirrored or RAID 5 Volumes, you must use Dynamic Disks. The journalling allows Windows to make changes to these partitions and filesystems without the need to reboot. Once the LDM driver has divided up the disk, you can use the MD driver to assemble any multi-partition volumes, e.g. Stripes, RAID5. To prevent legacy applications from repartitioning the disk, the LDM creates a dummy MSDOS partition containing one disk-sized partition. This is what is supported with the Linux LDM driver. A newer approach that has been implemented with Vista is to put LDM on top of a GPT label disk. This is not supported by the Linux LDM driver yet. Example ------- Below we have a 50MiB disk, divided into seven partitions. N.B. The missing 1MiB at the end of the disk is where the LDM database is stored. Device | Offset Bytes Sectors MiB | Size Bytes Sectors MiB -------+----------------------------+--------------------------- hda | 0 0 0 | 52428800 102400 50 hda1 | 51380224 100352 49 | 1048576 2048 1 hda2 | 16384 32 0 | 6979584 13632 6 hda3 | 6995968 13664 6 | 10485760 20480 10 hda4 | 17481728 34144 16 | 4194304 8192 4 hda5 | 21676032 42336 20 | 5242880 10240 5 hda6 | 26918912 52576 25 | 10485760 20480 10 hda7 | 37404672 73056 35 | 13959168 27264 13 The LDM Database may not store the partitions in the order that they appear on disk, but the driver will sort them. When Linux boots, you will see something like: hda: 102400 sectors w/32KiB Cache, CHS=50/64/32 hda: [LDM] hda1 hda2 hda3 hda4 hda5 hda6 hda7 Compiling LDM Support --------------------- To enable LDM, choose the following two options: "Advanced partition selection" CONFIG_PARTITION_ADVANCED "Windows Logical Disk Manager (Dynamic Disk) support" CONFIG_LDM_PARTITION If you believe the driver isn't working as it should, you can enable the extra debugging code. This will produce a LOT of output. The option is: "Windows LDM extra logging" CONFIG_LDM_DEBUG N.B. The partition code cannot be compiled as a module. As with all the partition code, if the driver doesn't see signs of its type of partition, it will pass control to another driver, so there is no harm in enabling it. If you have Dynamic Disks but don't enable the driver, then all you will see is a dummy MSDOS partition filling the whole disk. You won't be able to mount any of the volumes on the disk. Booting ------- If you enable LDM support, then lilo is capable of booting from any of the discovered partitions. However, grub does not understand the LDM partitioning and cannot boot from a Dynamic Disk. More Documentation ------------------ There is an Overview of the LDM together with complete Technical Documentation. It is available for download. http://www.linux-ntfs.org/ If you have any LDM questions that aren't answered in the documentation, email me. Cheers, FlatCap - Richard Russon ldm@flatcap.org Semantics and Behavior of Local Atomic Operations Mathieu Desnoyers This document explains the purpose of the local atomic operations, how to implement them for any given architecture and shows how they can be used properly. It also stresses on the precautions that must be taken when reading those local variables across CPUs when the order of memory writes matters. * Purpose of local atomic operations Local atomic operations are meant to provide fast and highly reentrant per CPU counters. They minimize the performance cost of standard atomic operations by removing the LOCK prefix and memory barriers normally required to synchronize across CPUs. Having fast per CPU atomic counters is interesting in many cases : it does not require disabling interrupts to protect from interrupt handlers and it permits coherent counters in NMI handlers. It is especially useful for tracing purposes and for various performance monitoring counters. Local atomic operations only guarantee variable modification atomicity wrt the CPU which owns the data. Therefore, care must taken to make sure that only one CPU writes to the local_t data. This is done by using per cpu data and making sure that we modify it from within a preemption safe context. It is however permitted to read local_t data from any CPU : it will then appear to be written out of order wrt other memory writes by the owner CPU. * Implementation for a given architecture It can be done by slightly modifying the standard atomic operations : only their UP variant must be kept. It typically means removing LOCK prefix (on i386 and x86_64) and any SMP synchronization barrier. If the architecture does not have a different behavior between SMP and UP, including asm-generic/local.h in your architecture's local.h is sufficient. The local_t type is defined as an opaque signed long by embedding an atomic_long_t inside a structure. This is made so a cast from this type to a long fails. The definition looks like : typedef struct { atomic_long_t a; } local_t; * Rules to follow when using local atomic operations - Variables touched by local ops must be per cpu variables. - _Only_ the CPU owner of these variables must write to them. - This CPU can use local ops from any context (process, irq, softirq, nmi, ...) to update its local_t variables. - Preemption (or interrupts) must be disabled when using local ops in process context to make sure the process won't be migrated to a different CPU between getting the per-cpu variable and doing the actual local op. - When using local ops in interrupt context, no special care must be taken on a mainline kernel, since they will run on the local CPU with preemption already disabled. I suggest, however, to explicitly disable preemption anyway to make sure it will still work correctly on -rt kernels. - Reading the local cpu variable will provide the current copy of the variable. - Reads of these variables can be done from any CPU, because updates to "long", aligned, variables are always atomic. Since no memory synchronization is done by the writer CPU, an outdated copy of the variable can be read when reading some _other_ cpu's variables. * How to use local atomic operations #include #include static DEFINE_PER_CPU(local_t, counters) = LOCAL_INIT(0); * Counting Counting is done on all the bits of a signed long. In preemptible context, use get_cpu_var() and put_cpu_var() around local atomic operations : it makes sure that preemption is disabled around write access to the per cpu variable. For instance : local_inc(&get_cpu_var(counters)); put_cpu_var(counters); If you are already in a preemption-safe context, you can directly use __get_cpu_var() instead. local_inc(&__get_cpu_var(counters)); * Reading the counters Those local counters can be read from foreign CPUs to sum the count. Note that the data seen by local_read across CPUs must be considered to be out of order relatively to other memory writes happening on the CPU that owns the data. long sum = 0; for_each_online_cpu(cpu) sum += local_read(&per_cpu(counters, cpu)); If you want to use a remote local_read to synchronize access to a resource between CPUs, explicit smp_wmb() and smp_rmb() memory barriers must be used respectively on the writer and the reader CPUs. It would be the case if you use the local_t variable as a counter of bytes written in a buffer : there should be a smp_wmb() between the buffer write and the counter increment and also a smp_rmb() between the counter read and the buffer read. Here is a sample module which implements a basic per cpu counter using local.h. --- BEGIN --- /* test-local.c * * Sample module for local.h usage. */ #include #include #include static DEFINE_PER_CPU(local_t, counters) = LOCAL_INIT(0); static struct timer_list test_timer; /* IPI called on each CPU. */ static void test_each(void *info) { /* Increment the counter from a non preemptible context */ printk("Increment on cpu %d\n", smp_processor_id()); local_inc(&__get_cpu_var(counters)); /* This is what incrementing the variable would look like within a * preemptible context (it disables preemption) : * * local_inc(&get_cpu_var(counters)); * put_cpu_var(counters); */ } static void do_test_timer(unsigned long data) { int cpu; /* Increment the counters */ on_each_cpu(test_each, NULL, 1); /* Read all the counters */ printk("Counters read from CPU %d\n", smp_processor_id()); for_each_online_cpu(cpu) { printk("Read : CPU %d, count %ld\n", cpu, local_read(&per_cpu(counters, cpu))); } del_timer(&test_timer); test_timer.expires = jiffies + 1000; add_timer(&test_timer); } static int __init test_init(void) { /* initialize the timer that will increment the counter */ init_timer(&test_timer); test_timer.function = do_test_timer; test_timer.expires = jiffies + 1; add_timer(&test_timer); return 0; } static void __exit test_exit(void) { del_timer_sync(&test_timer); } module_init(test_init); module_exit(test_exit); MODULE_LICENSE("GPL"); MODULE_AUTHOR("Mathieu Desnoyers"); MODULE_DESCRIPTION("Local Atomic Ops"); --- END --- Runtime locking correctness validator ===================================== started by Ingo Molnar additions by Arjan van de Ven Lock-class ---------- The basic object the validator operates upon is a 'class' of locks. A class of locks is a group of locks that are logically the same with respect to locking rules, even if the locks may have multiple (possibly tens of thousands of) instantiations. For example a lock in the inode struct is one class, while each inode has its own instantiation of that lock class. The validator tracks the 'state' of lock-classes, and it tracks dependencies between different lock-classes. The validator maintains a rolling proof that the state and the dependencies are correct. Unlike an lock instantiation, the lock-class itself never goes away: when a lock-class is used for the first time after bootup it gets registered, and all subsequent uses of that lock-class will be attached to this lock-class. State ----- The validator tracks lock-class usage history into 4n + 1 separate state bits: - 'ever held in STATE context' - 'ever held as readlock in STATE context' - 'ever held with STATE enabled' - 'ever held as readlock with STATE enabled' Where STATE can be either one of (kernel/lockdep_states.h) - hardirq - softirq - reclaim_fs - 'ever used' [ == !unused ] When locking rules are violated, these state bits are presented in the locking error messages, inside curlies. A contrived example: modprobe/2287 is trying to acquire lock: (&sio_locks[i].lock){-.-...}, at: [] mutex_lock+0x21/0x24 but task is already holding lock: (&sio_locks[i].lock){-.-...}, at: [] mutex_lock+0x21/0x24 The bit position indicates STATE, STATE-read, for each of the states listed above, and the character displayed in each indicates: '.' acquired while irqs disabled and not in irq context '-' acquired in irq context '+' acquired with irqs enabled '?' acquired in irq context with irqs enabled. Unused mutexes cannot be part of the cause of an error. Single-lock state rules: ------------------------ A softirq-unsafe lock-class is automatically hardirq-unsafe as well. The following states are exclusive, and only one of them is allowed to be set for any lock-class: and and The validator detects and reports lock usage that violate these single-lock state rules. Multi-lock dependency rules: ---------------------------- The same lock-class must not be acquired twice, because this could lead to lock recursion deadlocks. Furthermore, two locks may not be taken in different order: -> -> because this could lead to lock inversion deadlocks. (The validator finds such dependencies in arbitrary complexity, i.e. there can be any other locking sequence between the acquire-lock operations, the validator will still track all dependencies between locks.) Furthermore, the following usage based lock dependencies are not allowed between any two lock-classes: -> -> The first rule comes from the fact the a hardirq-safe lock could be taken by a hardirq context, interrupting a hardirq-unsafe lock - and thus could result in a lock inversion deadlock. Likewise, a softirq-safe lock could be taken by an softirq context, interrupting a softirq-unsafe lock. The above rules are enforced for any locking sequence that occurs in the kernel: when acquiring a new lock, the validator checks whether there is any rule violation between the new lock and any of the held locks. When a lock-class changes its state, the following aspects of the above dependency rules are enforced: - if a new hardirq-safe lock is discovered, we check whether it took any hardirq-unsafe lock in the past. - if a new softirq-safe lock is discovered, we check whether it took any softirq-unsafe lock in the past. - if a new hardirq-unsafe lock is discovered, we check whether any hardirq-safe lock took it in the past. - if a new softirq-unsafe lock is discovered, we check whether any softirq-safe lock took it in the past. (Again, we do these checks too on the basis that an interrupt context could interrupt _any_ of the irq-unsafe or hardirq-unsafe locks, which could lead to a lock inversion deadlock - even if that lock scenario did not trigger in practice yet.) Exception: Nested data dependencies leading to nested locking ------------------------------------------------------------- There are a few cases where the Linux kernel acquires more than one instance of the same lock-class. Such cases typically happen when there is some sort of hierarchy within objects of the same type. In these cases there is an inherent "natural" ordering between the two objects (defined by the properties of the hierarchy), and the kernel grabs the locks in this fixed order on each of the objects. An example of such an object hierarchy that results in "nested locking" is that of a "whole disk" block-dev object and a "partition" block-dev object; the partition is "part of" the whole device and as long as one always takes the whole disk lock as a higher lock than the partition lock, the lock ordering is fully correct. The validator does not automatically detect this natural ordering, as the locking rule behind the ordering is not static. In order to teach the validator about this correct usage model, new versions of the various locking primitives were added that allow you to specify a "nesting level". An example call, for the block device mutex, looks like this: enum bdev_bd_mutex_lock_class { BD_MUTEX_NORMAL, BD_MUTEX_WHOLE, BD_MUTEX_PARTITION }; mutex_lock_nested(&bdev->bd_contains->bd_mutex, BD_MUTEX_PARTITION); In this case the locking is done on a bdev object that is known to be a partition. The validator treats a lock that is taken in such a nested fashion as a separate (sub)class for the purposes of validation. Note: When changing code to use the _nested() primitives, be careful and check really thoroughly that the hierarchy is correctly mapped; otherwise you can get false positives or false negatives. Proof of 100% correctness: -------------------------- The validator achieves perfect, mathematical 'closure' (proof of locking correctness) in the sense that for every simple, standalone single-task locking sequence that occurred at least once during the lifetime of the kernel, the validator proves it with a 100% certainty that no combination and timing of these locking sequences can cause any class of lock related deadlock. [*] I.e. complex multi-CPU and multi-task locking scenarios do not have to occur in practice to prove a deadlock: only the simple 'component' locking chains have to occur at least once (anytime, in any task/context) for the validator to be able to prove correctness. (For example, complex deadlocks that would normally need more than 3 CPUs and a very unlikely constellation of tasks, irq-contexts and timings to occur, can be detected on a plain, lightly loaded single-CPU system as well!) This radically decreases the complexity of locking related QA of the kernel: what has to be done during QA is to trigger as many "simple" single-task locking dependencies in the kernel as possible, at least once, to prove locking correctness - instead of having to trigger every possible combination of locking interaction between CPUs, combined with every possible hardirq and softirq nesting scenario (which is impossible to do in practice). [*] assuming that the validator itself is 100% correct, and no other part of the system corrupts the state of the validator in any way. We also assume that all NMI/SMM paths [which could interrupt even hardirq-disabled codepaths] are correct and do not interfere with the validator. We also assume that the 64-bit 'chain hash' value is unique for every lock-chain in the system. Also, lock recursion must not be higher than 20. Performance: ------------ The above rules require _massive_ amounts of runtime checking. If we did that for every lock taken and for every irqs-enable event, it would render the system practically unusably slow. The complexity of checking is O(N^2), so even with just a few hundred lock-classes we'd have to do tens of thousands of checks for every event. This problem is solved by checking any given 'locking scenario' (unique sequence of locks taken after each other) only once. A simple stack of held locks is maintained, and a lightweight 64-bit hash value is calculated, which hash is unique for every lock chain. The hash value, when the chain is validated for the first time, is then put into a hash table, which hash-table can be checked in a lockfree manner. If the locking chain occurs again later on, the hash table tells us that we dont have to validate the chain again. Troubleshooting: ---------------- The validator tracks a maximum of MAX_LOCKDEP_KEYS number of lock classes. Exceeding this number will trigger the following lockdep warning: (DEBUG_LOCKS_WARN_ON(id >= MAX_LOCKDEP_KEYS)) By default, MAX_LOCKDEP_KEYS is currently set to 8191, and typical desktop systems have less than 1,000 lock classes, so this warning normally results from lock-class leakage or failure to properly initialize locks. These two problems are illustrated below: 1. Repeated module loading and unloading while running the validator will result in lock-class leakage. The issue here is that each load of the module will create a new set of lock classes for that module's locks, but module unloading does not remove old classes (see below discussion of reuse of lock classes for why). Therefore, if that module is loaded and unloaded repeatedly, the number of lock classes will eventually reach the maximum. 2. Using structures such as arrays that have large numbers of locks that are not explicitly initialized. For example, a hash table with 8192 buckets where each bucket has its own spinlock_t will consume 8192 lock classes -unless- each spinlock is explicitly initialized at runtime, for example, using the run-time spin_lock_init() as opposed to compile-time initializers such as __SPIN_LOCK_UNLOCKED(). Failure to properly initialize the per-bucket spinlocks would guarantee lock-class overflow. In contrast, a loop that called spin_lock_init() on each lock would place all 8192 locks into a single lock class. The moral of this story is that you should always explicitly initialize your locks. One might argue that the validator should be modified to allow lock classes to be reused. However, if you are tempted to make this argument, first review the code and think through the changes that would be required, keeping in mind that the lock classes to be removed are likely to be linked into the lock-dependency graph. This turns out to be harder to do than to say. Of course, if you do run out of lock classes, the next thing to do is to find the offending lock classes. First, the following command gives you the number of lock classes currently in use along with the maximum: grep "lock-classes" /proc/lockdep_stats This command produces the following output on a modest system: lock-classes: 748 [max: 8191] If the number allocated (748 above) increases continually over time, then there is likely a leak. The following command can be used to identify the leaking lock classes: grep "BD" /proc/lockdep Run the command and save the output, then compare against the output from a later run of this command to identify the leakers. This same output can also help you find situations where runtime lock initialization has been omitted. LOCK STATISTICS - WHAT As the name suggests, it provides statistics on locks. - WHY Because things like lock contention can severely impact performance. - HOW Lockdep already has hooks in the lock functions and maps lock instances to lock classes. We build on that (see Documentation/lockdep-design.txt). The graph below shows the relation between the lock functions and the various hooks therein. __acquire | lock _____ | \ | __contended | | | | _______/ |/ | __acquired | . . | __release | unlock lock, unlock - the regular lock functions __* - the hooks <> - states With these hooks we provide the following statistics: con-bounces - number of lock contention that involved x-cpu data contentions - number of lock acquisitions that had to wait wait time min - shortest (non-0) time we ever had to wait for a lock max - longest time we ever had to wait for a lock total - total time we spend waiting on this lock acq-bounces - number of lock acquisitions that involved x-cpu data acquisitions - number of times we took the lock hold time min - shortest (non-0) time we ever held the lock max - longest time we ever held the lock total - total time this lock was held From these number various other statistics can be derived, such as: hold time average = hold time total / acquisitions These numbers are gathered per lock class, per read/write state (when applicable). It also tracks 4 contention points per class. A contention point is a call site that had to wait on lock acquisition. - CONFIGURATION Lock statistics are enabled via CONFIG_LOCK_STATS. - USAGE Enable collection of statistics: # echo 1 >/proc/sys/kernel/lock_stat Disable collection of statistics: # echo 0 >/proc/sys/kernel/lock_stat Look at the current lock statistics: ( line numbers not part of actual output, done for clarity in the explanation below ) # less /proc/lock_stat 01 lock_stat version 0.3 02 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 03 class name con-bounces contentions waittime-min waittime-max waittime-total acq-bounces acquisitions holdtime-min holdtime-max holdtime-total 04 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 05 06 &mm->mmap_sem-W: 233 538 18446744073708 22924.27 607243.51 1342 45806 1.71 8595.89 1180582.34 07 &mm->mmap_sem-R: 205 587 18446744073708 28403.36 731975.00 1940 412426 0.58 187825.45 6307502.88 08 --------------- 09 &mm->mmap_sem 487 [] do_page_fault+0x466/0x928 10 &mm->mmap_sem 179 [] sys_mprotect+0xcd/0x21d 11 &mm->mmap_sem 279 [] sys_mmap+0x75/0xce 12 &mm->mmap_sem 76 [] sys_munmap+0x32/0x59 13 --------------- 14 &mm->mmap_sem 270 [] sys_mmap+0x75/0xce 15 &mm->mmap_sem 431 [] do_page_fault+0x466/0x928 16 &mm->mmap_sem 138 [] sys_munmap+0x32/0x59 17 &mm->mmap_sem 145 [] sys_mprotect+0xcd/0x21d 18 19 ............................................................................................................................................................................................... 20 21 dcache_lock: 621 623 0.52 118.26 1053.02 6745 91930 0.29 316.29 118423.41 22 ----------- 23 dcache_lock 179 [] _atomic_dec_and_lock+0x34/0x54 24 dcache_lock 113 [] d_alloc+0x19a/0x1eb 25 dcache_lock 99 [] d_rehash+0x1b/0x44 26 dcache_lock 104 [] d_instantiate+0x36/0x8a 27 ----------- 28 dcache_lock 192 [] _atomic_dec_and_lock+0x34/0x54 29 dcache_lock 98 [] d_rehash+0x1b/0x44 30 dcache_lock 72 [] d_alloc+0x19a/0x1eb 31 dcache_lock 112 [] d_instantiate+0x36/0x8a This excerpt shows the first two lock class statistics. Line 01 shows the output version - each time the format changes this will be updated. Line 02-04 show the header with column descriptions. Lines 05-18 and 20-31 show the actual statistics. These statistics come in two parts; the actual stats separated by a short separator (line 08, 13) from the contention points. The first lock (05-18) is a read/write lock, and shows two lines above the short separator. The contention points don't match the column descriptors, they have two: contentions and [] symbol. The second set of contention points are the points we're contending with. The integer part of the time values is in us. Dealing with nested locks, subclasses may appear: 32............................................................................................................................................................................................... 33 34 &rq->lock: 13128 13128 0.43 190.53 103881.26 97454 3453404 0.00 401.11 13224683.11 35 --------- 36 &rq->lock 645 [] task_rq_lock+0x43/0x75 37 &rq->lock 297 [] try_to_wake_up+0x127/0x25a 38 &rq->lock 360 [] select_task_rq_fair+0x1f0/0x74a 39 &rq->lock 428 [] scheduler_tick+0x46/0x1fb 40 --------- 41 &rq->lock 77 [] task_rq_lock+0x43/0x75 42 &rq->lock 174 [] try_to_wake_up+0x127/0x25a 43 &rq->lock 4715 [] double_rq_lock+0x42/0x54 44 &rq->lock 893 [] schedule+0x157/0x7b8 45 46............................................................................................................................................................................................... 47 48 &rq->lock/1: 11526 11488 0.33 388.73 136294.31 21461 38404 0.00 37.93 109388.53 49 ----------- 50 &rq->lock/1 11526 [] double_rq_lock+0x4f/0x54 51 ----------- 52 &rq->lock/1 5645 [] double_rq_lock+0x42/0x54 53 &rq->lock/1 1224 [] schedule+0x157/0x7b8 54 &rq->lock/1 4336 [] double_rq_lock+0x4f/0x54 55 &rq->lock/1 181 [] try_to_wake_up+0x127/0x25a Line 48 shows statistics for the second subclass (/1) of &rq->lock class (subclass starts from 0), since in this case, as line 50 suggests, double_rq_lock actually acquires a nested lock of two spinlocks. View the top contending locks: # grep : /proc/lock_stat | head &inode->i_data.tree_lock-W: 15 21657 0.18 1093295.30 11547131054.85 58 10415 0.16 87.51 6387.60 &inode->i_data.tree_lock-R: 0 0 0.00 0.00 0.00 23302 231198 0.25 8.45 98023.38 dcache_lock: 1037 1161 0.38 45.32 774.51 6611 243371 0.15 306.48 77387.24 &inode->i_mutex: 161 286 18446744073709 62882.54 1244614.55 3653 20598 18446744073709 62318.60 1693822.74 &zone->lru_lock: 94 94 0.53 7.33 92.10 4366 32690 0.29 59.81 16350.06 &inode->i_data.i_mmap_mutex: 79 79 0.40 3.77 53.03 11779 87755 0.28 116.93 29898.44 &q->__queue_lock: 48 50 0.52 31.62 86.31 774 13131 0.17 113.08 12277.52 &rq->rq_lock_key: 43 47 0.74 68.50 170.63 3706 33929 0.22 107.99 17460.62 &rq->rq_lock_key#2: 39 46 0.75 6.68 49.03 2979 32292 0.17 125.17 17137.63 tasklist_lock-W: 15 15 1.45 10.87 32.70 1201 7390 0.58 62.55 13648.47 Clear the statistics: # echo 0 > /proc/lock_stat This file is a registry of magic numbers which are in use. When you add a magic number to a structure, you should also add it to this file, since it is best if the magic numbers used by various structures are unique. It is a *very* good idea to protect kernel data structures with magic numbers. This allows you to check at run time whether (a) a structure has been clobbered, or (b) you've passed the wrong structure to a routine. This last is especially useful --- particularly when you are passing pointers to structures via a void * pointer. The tty code, for example, does this frequently to pass driver-specific and line discipline-specific structures back and forth. The way to use magic numbers is to declare then at the beginning of the structure, like so: struct tty_ldisc { int magic; ... }; Please follow this discipline when you are adding future enhancements to the kernel! It has saved me countless hours of debugging, especially in the screwy cases where an array has been overrun and structures following the array have been overwritten. Using this discipline, these cases get detected quickly and safely. Theodore Ts'o 31 Mar 94 The magic table is current to Linux 2.1.55. Michael Chastain 22 Sep 1997 Now it should be up to date with Linux 2.1.112. Because we are in feature freeze time it is very unlikely that something will change before 2.2.x. The entries are sorted by number field. Krzysztof G. Baranowski 29 Jul 1998 Updated the magic table to Linux 2.5.45. Right over the feature freeze, but it is possible that some new magic numbers will sneak into the kernel before 2.6.x yet. Petr Baudis 03 Nov 2002 Updated the magic table to Linux 2.5.74. Fabian Frederick 09 Jul 2003 Magic Name Number Structure File =========================================================================== PG_MAGIC 'P' pg_{read,write}_hdr include/linux/pg.h CMAGIC 0x0111 user include/linux/a.out.h MKISS_DRIVER_MAGIC 0x04bf mkiss_channel drivers/net/mkiss.h RISCOM8_MAGIC 0x0907 riscom_port drivers/char/riscom8.h SPECIALIX_MAGIC 0x0907 specialix_port drivers/char/specialix_io8.h HDLC_MAGIC 0x239e n_hdlc drivers/char/n_hdlc.c APM_BIOS_MAGIC 0x4101 apm_user arch/x86/kernel/apm_32.c CYCLADES_MAGIC 0x4359 cyclades_port include/linux/cyclades.h DB_MAGIC 0x4442 fc_info drivers/net/iph5526_novram.c DL_MAGIC 0x444d fc_info drivers/net/iph5526_novram.c FASYNC_MAGIC 0x4601 fasync_struct include/linux/fs.h FF_MAGIC 0x4646 fc_info drivers/net/iph5526_novram.c ISICOM_MAGIC 0x4d54 isi_port include/linux/isicom.h PTY_MAGIC 0x5001 drivers/char/pty.c PPP_MAGIC 0x5002 ppp include/linux/if_pppvar.h SERIAL_MAGIC 0x5301 async_struct include/linux/serial.h SSTATE_MAGIC 0x5302 serial_state include/linux/serial.h SLIP_MAGIC 0x5302 slip drivers/net/slip.h STRIP_MAGIC 0x5303 strip drivers/net/strip.c X25_ASY_MAGIC 0x5303 x25_asy drivers/net/x25_asy.h SIXPACK_MAGIC 0x5304 sixpack drivers/net/hamradio/6pack.h AX25_MAGIC 0x5316 ax_disp drivers/net/mkiss.h ESP_MAGIC 0x53ee esp_struct drivers/char/esp.h TTY_MAGIC 0x5401 tty_struct include/linux/tty.h MGSL_MAGIC 0x5401 mgsl_info drivers/char/synclink.c TTY_DRIVER_MAGIC 0x5402 tty_driver include/linux/tty_driver.h MGSLPC_MAGIC 0x5402 mgslpc_info drivers/char/pcmcia/synclink_cs.c TTY_LDISC_MAGIC 0x5403 tty_ldisc include/linux/tty_ldisc.h USB_SERIAL_MAGIC 0x6702 usb_serial drivers/usb/serial/usb-serial.h FULL_DUPLEX_MAGIC 0x6969 drivers/net/tulip/de2104x.c USB_BLUETOOTH_MAGIC 0x6d02 usb_bluetooth drivers/usb/class/bluetty.c RFCOMM_TTY_MAGIC 0x6d02 net/bluetooth/rfcomm/tty.c USB_SERIAL_PORT_MAGIC 0x7301 usb_serial_port drivers/usb/serial/usb-serial.h CG_MAGIC 0x00090255 ufs_cylinder_group include/linux/ufs_fs.h A2232_MAGIC 0x000a2232 gs_port drivers/char/ser_a2232.h RPORT_MAGIC 0x00525001 r_port drivers/char/rocket_int.h LSEMAGIC 0x05091998 lse drivers/fc4/fc.c GDTIOCTL_MAGIC 0x06030f07 gdth_iowr_str drivers/scsi/gdth_ioctl.h RIEBL_MAGIC 0x09051990 drivers/net/atarilance.c RIO_MAGIC 0x12345678 gs_port drivers/char/rio/rio_linux.c SX_MAGIC 0x12345678 gs_port drivers/char/sx.h NBD_REQUEST_MAGIC 0x12560953 nbd_request include/linux/nbd.h RED_MAGIC2 0x170fc2a5 (any) mm/slab.c BAYCOM_MAGIC 0x19730510 baycom_state drivers/net/baycom_epp.c ISDN_X25IFACE_MAGIC 0x1e75a2b9 isdn_x25iface_proto_data drivers/isdn/isdn_x25iface.h ECP_MAGIC 0x21504345 cdkecpsig include/linux/cdk.h LSOMAGIC 0x27091997 lso drivers/fc4/fc.c LSMAGIC 0x2a3b4d2a ls drivers/fc4/fc.c WANPIPE_MAGIC 0x414C4453 sdla_{dump,exec} include/linux/wanpipe.h CS_CARD_MAGIC 0x43525553 cs_card sound/oss/cs46xx.c LABELCL_MAGIC 0x4857434c labelcl_info_s include/asm/ia64/sn/labelcl.h ISDN_ASYNC_MAGIC 0x49344C01 modem_info include/linux/isdn.h CTC_ASYNC_MAGIC 0x49344C01 ctc_tty_info drivers/s390/net/ctctty.c ISDN_NET_MAGIC 0x49344C02 isdn_net_local_s drivers/isdn/i4l/isdn_net_lib.h SAVEKMSG_MAGIC2 0x4B4D5347 savekmsg arch/*/amiga/config.c STLI_BOARDMAGIC 0x4bc6c825 stlibrd include/linux/istallion.h CS_STATE_MAGIC 0x4c4f4749 cs_state sound/oss/cs46xx.c SLAB_C_MAGIC 0x4f17a36d kmem_cache mm/slab.c COW_MAGIC 0x4f4f4f4d cow_header_v1 arch/um/drivers/ubd_user.c I810_CARD_MAGIC 0x5072696E i810_card sound/oss/i810_audio.c TRIDENT_CARD_MAGIC 0x5072696E trident_card sound/oss/trident.c ROUTER_MAGIC 0x524d4157 wan_device include/linux/wanrouter.h SCC_MAGIC 0x52696368 gs_port drivers/char/scc.h SAVEKMSG_MAGIC1 0x53415645 savekmsg arch/*/amiga/config.c GDA_MAGIC 0x58464552 gda arch/mips/include/asm/sn/gda.h RED_MAGIC1 0x5a2cf071 (any) mm/slab.c STL_PORTMAGIC 0x5a7182c9 stlport include/linux/stallion.h EEPROM_MAGIC_VALUE 0x5ab478d2 lanai_dev drivers/atm/lanai.c HDLCDRV_MAGIC 0x5ac6e778 hdlcdrv_state include/linux/hdlcdrv.h EPCA_MAGIC 0x5c6df104 channel include/linux/epca.h PCXX_MAGIC 0x5c6df104 channel drivers/char/pcxx.h KV_MAGIC 0x5f4b565f kernel_vars_s arch/mips/include/asm/sn/klkernvars.h I810_STATE_MAGIC 0x63657373 i810_state sound/oss/i810_audio.c TRIDENT_STATE_MAGIC 0x63657373 trient_state sound/oss/trident.c M3_CARD_MAGIC 0x646e6f50 m3_card sound/oss/maestro3.c FW_HEADER_MAGIC 0x65726F66 fw_header drivers/atm/fore200e.h SLOT_MAGIC 0x67267321 slot drivers/hotplug/cpqphp.h SLOT_MAGIC 0x67267322 slot drivers/hotplug/acpiphp.h LO_MAGIC 0x68797548 nbd_device include/linux/nbd.h OPROFILE_MAGIC 0x6f70726f super_block drivers/oprofile/oprofilefs.h M3_STATE_MAGIC 0x734d724d m3_state sound/oss/maestro3.c STL_PANELMAGIC 0x7ef621a1 stlpanel include/linux/stallion.h VMALLOC_MAGIC 0x87654320 snd_alloc_track sound/core/memory.c KMALLOC_MAGIC 0x87654321 snd_alloc_track sound/core/memory.c PWC_MAGIC 0x89DC10AB pwc_device drivers/usb/media/pwc.h NBD_REPLY_MAGIC 0x96744668 nbd_reply include/linux/nbd.h STL_BOARDMAGIC 0xa2267f52 stlbrd include/linux/stallion.h ENI155_MAGIC 0xa54b872d midway_eprom drivers/atm/eni.h SCI_MAGIC 0xbabeface gs_port drivers/char/sh-sci.h CODA_MAGIC 0xC0DAC0DA coda_file_info fs/coda/coda_fs_i.h DPMEM_MAGIC 0xc0ffee11 gdt_pci_sram drivers/scsi/gdth.h STLI_PORTMAGIC 0xe671c7a1 stliport include/linux/istallion.h YAM_MAGIC 0xF10A7654 yam_port drivers/net/hamradio/yam.c CCB_MAGIC 0xf2691ad2 ccb drivers/scsi/ncr53c8xx.c QUEUE_MAGIC_FREE 0xf7e1c9a3 queue_entry drivers/scsi/arm/queue.c QUEUE_MAGIC_USED 0xf7e1cc33 queue_entry drivers/scsi/arm/queue.c HTB_CMAGIC 0xFEFAFEF1 htb_class net/sched/sch_htb.c NMI_MAGIC 0x48414d4d455201 nmi_s arch/mips/include/asm/sn/nmi.h Note that there are also defined special per-driver magic numbers in sound memory management. See include/sound/sndmagic.h for complete list of them. Many OSS sound drivers have their magic numbers constructed from the soundcard PCI ID - these are not listed here as well. IrDA subsystem also uses large number of own magic numbers, see include/net/irda/irda.h for a complete list of them. HFS is another larger user of magic numbers - you can find them in fs/hfs/hfs.h. i386 Micro Channel Architecture Support ======================================= MCA support is enabled using the CONFIG_MCA define. A machine with a MCA bus will have the kernel variable MCA_bus set, assuming the BIOS feature bits are set properly (see arch/i386/boot/setup.S for information on how this detection is done). Adapter Detection ================= The ideal MCA adapter detection is done through the use of the Programmable Option Select registers. Generic functions for doing this have been added in include/linux/mca.h and arch/x86/kernel/mca_32.c. Everything needed to detect adapters and read (and write) configuration information is there. A number of MCA-specific drivers already use this. The typical probe code looks like the following: #include unsigned char pos2, pos3, pos4, pos5; struct net_device* dev; int slot; if( MCA_bus ) { slot = mca_find_adapter( ADAPTER_ID, 0 ); if( slot == MCA_NOTFOUND ) { return -ENODEV; } /* optional - see below */ mca_set_adapter_name( slot, "adapter name & description" ); mca_set_adapter_procfn( slot, dev_getinfo, dev ); /* read the POS registers. Most devices only use 2 and 3 */ pos2 = mca_read_stored_pos( slot, 2 ); pos3 = mca_read_stored_pos( slot, 3 ); pos4 = mca_read_stored_pos( slot, 4 ); pos5 = mca_read_stored_pos( slot, 5 ); } else { return -ENODEV; } /* extract configuration from pos[2345] and set everything up */ Loadable modules should modify this to test that the specified IRQ and IO ports (plus whatever other stuff) match. See 3c523.c for example code (actually, smc-mca.c has a slightly more complex example that can handle a list of adapter ids). Keep in mind that devices should never directly access the POS registers (via inb(), outb(), etc). While it's generally safe, there is a small potential for blowing up hardware when it's done at the wrong time. Furthermore, accessing a POS register disables a device temporarily. This is usually okay during startup, but do _you_ want to rely on it? During initial configuration, mca_init() reads all the POS registers into memory. mca_read_stored_pos() accesses that data. mca_read_pos() and mca_write_pos() are also available for (safer) direct POS access, but their use is _highly_ discouraged. mca_write_pos() is particularly dangerous, as it is possible for adapters to be put in inconsistent states (i.e. sharing IO address, etc) and may result in crashes, toasted hardware, and blindness. User level drivers (such as the AGX X server) can use /proc/mca/pos to find adapters (see below). Some MCA adapters can also be detected via the usual ISA-style device probing (many SCSI adapters, for example). This sort of thing is highly discouraged. Perfectly good information is available telling you what's there, so there's no excuse for messing with random IO ports. However, we MCA people still appreciate any ISA-style driver that will work with our hardware. You take what you can get... Level-Triggered Interrupts ========================== Because MCA uses level-triggered interrupts, a few problems arise with what might best be described as the ISA mindset and its effects on drivers. These sorts of problems are expected to become less common as more people use shared IRQs on PCI machines. In general, an interrupt must be acknowledged not only at the ICU (which is done automagically by the kernel), but at the device level. In particular, IRQ 0 must be reset after a timer interrupt (now done in arch/x86/kernel/time.c) or the first timer interrupt hangs the system. There were also problems with the 1.3.x floppy drivers, but that seems to have been fixed. IRQs are also shareable, and most MCA-specific devices should be coded with shared IRQs in mind. /proc/mca ========= /proc/mca is a directory containing various files for adapters and other stuff. /proc/mca/pos Straight listing of POS registers /proc/mca/slot[1-8] Information on adapter in specific slot /proc/mca/video Same for integrated video /proc/mca/scsi Same for integrated SCSI /proc/mca/machine Machine information See Appendix A for a sample. Device drivers can easily add their own information function for specific slots (including integrated ones) via the mca_set_adapter_procfn() call. Drivers that support this are ESDI, IBM SCSI, and 3c523. If a device is also a module, make sure that the proc function is removed in the module cleanup. This will require storing the slot information in a private structure somewhere. See the 3c523 driver for details. Your typical proc function will look something like this: static int dev_getinfo( char* buf, int slot, void* d ) { struct net_device* dev = (struct net_device*) d; int len = 0; len += sprintf( buf+len, "Device: %s\n", dev->name ); len += sprintf( buf+len, "IRQ: %d\n", dev->irq ); len += sprintf( buf+len, "IO Port: %#lx-%#lx\n", ... ); ... return len; } Some of the standard MCA information will already be printed, so don't bother repeating it. Don't try putting in more than 3K of information. Enable this function with: mca_set_adapter_procfn( slot, dev_getinfo, dev ); Disable it with: mca_set_adapter_procfn( slot, NULL, NULL ); It is also recommended that, even if you don't write a proc function, to set the name of the adapter (i.e. "PS/2 ESDI Controller") via mca_set_adapter_name( int slot, char* name ). MCA Device Drivers ================== Currently, there are a number of MCA-specific device drivers. 1) PS/2 SCSI drivers/scsi/ibmmca.c drivers/scsi/ibmmca.h The driver for the IBM SCSI subsystem. Includes both integrated controllers and adapter cards. May require command-line arg "ibmmcascsi=io_port" to force detection of an adapter. If you have a machine with a front-panel display (i.e. model 95), you can use "ibmmcascsi=display" to enable a drive activity indicator. 2) 3c523 drivers/net/3c523.c drivers/net/3c523.h 3Com 3c523 Etherlink/MC ethernet driver. 3) SMC Ultra/MCA and IBM Adapter/A drivers/net/smc-mca.c drivers/net/smc-mca.h Driver for the MCA version of the SMC Ultra and various other OEM'ed and work-alike cards (Elite, Adapter/A, etc). 4) NE/2 driver/net/ne2.c driver/net/ne2.h The NE/2 is the MCA version of the NE2000. This may not work with clones that have a different adapter id than the original NE/2. 5) Future Domain MCS-600/700, OEM'd IBM Fast SCSI Adapter/A and Reply Sound Blaster/SCSI (SCSI part) Better support for these cards than the driver for ISA. Supports multiple cards with IRQ sharing. Also added boot time option of scsi-probe, which can do reordering of SCSI host adapters. This will direct the kernel on the order which SCSI adapter should be detected. Example: scsi-probe=ibmmca,fd_mcs,adaptec1542,buslogic The serial drivers were modified to support the extended IO port range of the typical MCA system (also #ifdef CONFIG_MCA). The following devices work with existing drivers: 1) Token-ring 2) Future Domain SCSI (MCS-600, MCS-700, not MCS-350, OEM'ed IBM SCSI) 3) Adaptec 1640 SCSI (using the aha1542 driver) 4) Bustek/Buslogic SCSI (various) 5) Probably all Arcnet cards. 6) Some, possibly all, MCA IDE controllers. 7) 3Com 3c529 (MCA version of 3c509) (patched) 8) Intel EtherExpressMC (patched version) You need to have CONFIG_MCA defined to have EtherExpressMC support. 9) Reply Sound Blaster/SCSI (SB part) (patched version) Bugs & Other Weirdness ====================== NMIs tend to occur with MCA machines because of various hardware weirdness, bus timeouts, and many other non-critical things. Some basic code to handle them (inspired by the NetBSD MCA code) has been added to detect the guilty device, but it's pretty incomplete. If NMIs are a persistent problem (on some model 70 or 80s, they occur every couple shell commands), the CONFIG_IGNORE_NMI flag will take care of that. Various Pentium machines have had serious problems with the FPU test in bugs.h. Basically, the machine hangs after the HLT test. This occurs, as far as we know, on the Pentium-equipped 85s, 95s, and some PC Servers. The PCI/MCA PC 750s are fine as far as I can tell. The ``mca-pentium'' boot-prompt flag will disable the FPU bug check if this is a problem with your machine. The model 80 has a raft of problems that are just too weird and unique to get into here. Some people have no trouble while others have nothing but problems. I'd suspect some problems are related to the age of the average 80 and accompanying hardware deterioration, although others are definitely design problems with the hardware. Among the problems include SCSI controller problems, ESDI controller problems, and serious screw-ups in the floppy controller. Oh, and the parallel port is also pretty flaky. There were about 5 or 6 different model 80 motherboards produced to fix various obscure problems. As far as I know, it's pretty much impossible to tell which bugs a particular model 80 has (other than triggering them, that is). Drivers are required for some MCA memory adapters. If you're suddenly short a few megs of RAM, this might be the reason. The (I think) Enhanced Memory Adapter commonly found on the model 70 is one. There's a very alpha driver floating around, but it's pretty ugly (disassembled from the DOS driver, actually). See the MCA Linux web page (URL below) for more current memory info. The Thinkpad 700 and 720 will work, but various components are either non-functional, flaky, or we don't know anything about them. The graphics controller is supposed to be some WD, but we can't get things working properly. The PCMCIA slots don't seem to work. Ditto for APM. The serial ports work, but detection seems to be flaky. Credits ======= A whole pile of people have contributed to the MCA code. I'd include their names here, but I don't have a list handy. Check the MCA Linux home page (URL below) for a perpetually out-of-date list. ===================================================================== MCA Linux Home Page: http://www.dgmicro.com/mca/ Christophe Beauregard chrisb@truespectra.com cpbeaure@calum.csclub.uwaterloo.ca ===================================================================== Appendix A: Sample /proc/mca This is from my model 8595. Slot 1 contains the standard IBM SCSI adapter, slot 3 is an Adaptec AHA-1640, slot 5 is a XGA-1 video adapter, and slot 7 is the 3c523 Etherlink/MC. /proc/mca/machine: Model Id: 0xf8 Submodel Id: 0x14 BIOS Revision: 0x5 /proc/mca/pos: Slot 1: ff 8e f1 fc a0 ff ff ff IBM SCSI Adapter w/Cache Slot 2: ff ff ff ff ff ff ff ff Slot 3: 1f 0f 81 3b bf b6 ff ff Slot 4: ff ff ff ff ff ff ff ff Slot 5: db 8f 1d 5e fd c0 00 00 Slot 6: ff ff ff ff ff ff ff ff Slot 7: 42 60 ff 08 ff ff ff ff 3Com 3c523 Etherlink/MC Slot 8: ff ff ff ff ff ff ff ff Video : ff ff ff ff ff ff ff ff SCSI : ff ff ff ff ff ff ff ff /proc/mca/slot1: Slot: 1 Adapter Name: IBM SCSI Adapter w/Cache Id: 8eff Enabled: Yes POS: ff 8e f1 fc a0 ff ff ff Subsystem PUN: 7 Detected at boot: Yes /proc/mca/slot3: Slot: 3 Adapter Name: Unknown Id: 0f1f Enabled: Yes POS: 1f 0f 81 3b bf b6 ff ff /proc/mca/slot5: Slot: 5 Adapter Name: Unknown Id: 8fdb Enabled: Yes POS: db 8f 1d 5e fd c0 00 00 /proc/mca/slot7: Slot: 7 Adapter Name: 3Com 3c523 Etherlink/MC Id: 6042 Enabled: Yes POS: 42 60 ff 08 ff ff ff ff Revision: 0xe IRQ: 9 IO Address: 0x3300-0x3308 Memory: 0xd8000-0xdbfff Transceiver: External Device: eth0 Hardware Address: 02 60 8c 45 c4 2a Tools that manage md devices can be found at http://www.kernel.org/pub/linux/utils/raid/ Boot time assembly of RAID arrays --------------------------------- You can boot with your md device with the following kernel command lines: for old raid arrays without persistent superblocks: md=,,,,dev0,dev1,...,devn for raid arrays with persistent superblocks md=,dev0,dev1,...,devn or, to assemble a partitionable array: md=d,dev0,dev1,...,devn md device no. = the number of the md device ... 0 means md0, 1 md1, 2 md2, 3 md3, 4 md4 raid level = -1 linear mode 0 striped mode other modes are only supported with persistent super blocks chunk size factor = (raid-0 and raid-1 only) Set the chunk size as 4k << n. fault level = totally ignored dev0-devn: e.g. /dev/hda1,/dev/hdc1,/dev/sda1,/dev/sdb1 A possible loadlin line (Harald Hoyer ) looks like this: e:\loadlin\loadlin e:\zimage root=/dev/md0 md=0,0,4,0,/dev/hdb2,/dev/hdc3 ro Boot time autodetection of RAID arrays -------------------------------------- When md is compiled into the kernel (not as module), partitions of type 0xfd are scanned and automatically assembled into RAID arrays. This autodetection may be suppressed with the kernel parameter "raid=noautodetect". As of kernel 2.6.9, only drives with a type 0 superblock can be autodetected and run at boot time. The kernel parameter "raid=partitionable" (or "raid=part") means that all auto-detected arrays are assembled as partitionable. Boot time assembly of degraded/dirty arrays ------------------------------------------- If a raid5 or raid6 array is both dirty and degraded, it could have undetectable data corruption. This is because the fact that it is 'dirty' means that the parity cannot be trusted, and the fact that it is degraded means that some datablocks are missing and cannot reliably be reconstructed (due to no parity). For this reason, md will normally refuse to start such an array. This requires the sysadmin to take action to explicitly start the array despite possible corruption. This is normally done with mdadm --assemble --force .... This option is not really available if the array has the root filesystem on it. In order to support this booting from such an array, md supports a module parameter "start_dirty_degraded" which, when set to 1, bypassed the checks and will allows dirty degraded arrays to be started. So, to boot with a root filesystem of a dirty degraded raid[56], use md-mod.start_dirty_degraded=1 Superblock formats ------------------ The md driver can support a variety of different superblock formats. Currently, it supports superblock formats "0.90.0" and the "md-1" format introduced in the 2.5 development series. The kernel will autodetect which format superblock is being used. Superblock format '0' is treated differently to others for legacy reasons - it is the original superblock format. General Rules - apply for all superblock formats ------------------------------------------------ An array is 'created' by writing appropriate superblocks to all devices. It is 'assembled' by associating each of these devices with an particular md virtual device. Once it is completely assembled, it can be accessed. An array should be created by a user-space tool. This will write superblocks to all devices. It will usually mark the array as 'unclean', or with some devices missing so that the kernel md driver can create appropriate redundancy (copying in raid1, parity calculation in raid4/5). When an array is assembled, it is first initialized with the SET_ARRAY_INFO ioctl. This contains, in particular, a major and minor version number. The major version number selects which superblock format is to be used. The minor number might be used to tune handling of the format, such as suggesting where on each device to look for the superblock. Then each device is added using the ADD_NEW_DISK ioctl. This provides, in particular, a major and minor number identifying the device to add. The array is started with the RUN_ARRAY ioctl. Once started, new devices can be added. They should have an appropriate superblock written to them, and then passed be in with ADD_NEW_DISK. Devices that have failed or are not yet active can be detached from an array using HOT_REMOVE_DISK. Specific Rules that apply to format-0 super block arrays, and arrays with no superblock (non-persistent). ------------------------------------------------------------- An array can be 'created' by describing the array (level, chunksize etc) in a SET_ARRAY_INFO ioctl. This must has major_version==0 and raid_disks != 0. Then uninitialized devices can be added with ADD_NEW_DISK. The structure passed to ADD_NEW_DISK must specify the state of the device and its role in the array. Once started with RUN_ARRAY, uninitialized spares can be added with HOT_ADD_DISK. MD devices in sysfs ------------------- md devices appear in sysfs (/sys) as regular block devices, e.g. /sys/block/md0 Each 'md' device will contain a subdirectory called 'md' which contains further md-specific information about the device. All md devices contain: level a text file indicating the 'raid level'. e.g. raid0, raid1, raid5, linear, multipath, faulty. If no raid level has been set yet (array is still being assembled), the value will reflect whatever has been written to it, which may be a name like the above, or may be a number such as '0', '5', etc. raid_disks a text file with a simple number indicating the number of devices in a fully functional array. If this is not yet known, the file will be empty. If an array is being resized this will contain the new number of devices. Some raid levels allow this value to be set while the array is active. This will reconfigure the array. Otherwise it can only be set while assembling an array. A change to this attribute will not be permitted if it would reduce the size of the array. To reduce the number of drives in an e.g. raid5, the array size must first be reduced by setting the 'array_size' attribute. chunk_size This is the size in bytes for 'chunks' and is only relevant to raid levels that involve striping (0,4,5,6,10). The address space of the array is conceptually divided into chunks and consecutive chunks are striped onto neighbouring devices. The size should be at least PAGE_SIZE (4k) and should be a power of 2. This can only be set while assembling an array layout The "layout" for the array for the particular level. This is simply a number that is interpretted differently by different levels. It can be written while assembling an array. array_size This can be used to artificially constrain the available space in the array to be less than is actually available on the combined devices. Writing a number (in Kilobytes) which is less than the available size will set the size. Any reconfiguration of the array (e.g. adding devices) will not cause the size to change. Writing the word 'default' will cause the effective size of the array to be whatever size is actually available based on 'level', 'chunk_size' and 'component_size'. This can be used to reduce the size of the array before reducing the number of devices in a raid4/5/6, or to support external metadata formats which mandate such clipping. reshape_position This is either "none" or a sector number within the devices of the array where "reshape" is up to. If this is set, the three attributes mentioned above (raid_disks, chunk_size, layout) can potentially have 2 values, an old and a new value. If these values differ, reading the attribute returns new (old) and writing will effect the 'new' value, leaving the 'old' unchanged. component_size For arrays with data redundancy (i.e. not raid0, linear, faulty, multipath), all components must be the same size - or at least there must a size that they all provide space for. This is a key part or the geometry of the array. It is measured in sectors and can be read from here. Writing to this value may resize the array if the personality supports it (raid1, raid5, raid6), and if the component drives are large enough. metadata_version This indicates the format that is being used to record metadata about the array. It can be 0.90 (traditional format), 1.0, 1.1, 1.2 (newer format in varying locations) or "none" indicating that the kernel isn't managing metadata at all. Alternately it can be "external:" followed by a string which is set by user-space. This indicates that metadata is managed by a user-space program. Any device failure or other event that requires a metadata update will cause array activity to be suspended until the event is acknowledged. resync_start The point at which resync should start. If no resync is needed, this will be a very large number (or 'none' since 2.6.30-rc1). At array creation it will default to 0, though starting the array as 'clean' will set it much larger. new_dev This file can be written but not read. The value written should be a block device number as major:minor. e.g. 8:0 This will cause that device to be attached to the array, if it is available. It will then appear at md/dev-XXX (depending on the name of the device) and further configuration is then possible. safe_mode_delay When an md array has seen no write requests for a certain period of time, it will be marked as 'clean'. When another write request arrives, the array is marked as 'dirty' before the write commences. This is known as 'safe_mode'. The 'certain period' is controlled by this file which stores the period as a number of seconds. The default is 200msec (0.200). Writing a value of 0 disables safemode. array_state This file contains a single word which describes the current state of the array. In many cases, the state can be set by writing the word for the desired state, however some states cannot be explicitly set, and some transitions are not allowed. Select/poll works on this file. All changes except between active_idle and active (which can be frequent and are not very interesting) are notified. active->active_idle is reported if the metadata is externally managed. clear No devices, no size, no level Writing is equivalent to STOP_ARRAY ioctl inactive May have some settings, but array is not active all IO results in error When written, doesn't tear down array, but just stops it suspended (not supported yet) All IO requests will block. The array can be reconfigured. Writing this, if accepted, will block until array is quiessent readonly no resync can happen. no superblocks get written. write requests fail read-auto like readonly, but behaves like 'clean' on a write request. clean - no pending writes, but otherwise active. When written to inactive array, starts without resync If a write request arrives then if metadata is known, mark 'dirty' and switch to 'active'. if not known, block and switch to write-pending If written to an active array that has pending writes, then fails. active fully active: IO and resync can be happening. When written to inactive array, starts with resync write-pending clean, but writes are blocked waiting for 'active' to be written. active-idle like active, but no writes have been seen for a while (safe_mode_delay). bitmap/location This indicates where the write-intent bitmap for the array is stored. It can be one of "none", "file" or "[+-]N". "file" may later be extended to "file:/file/name" "[+-]N" means that many sectors from the start of the metadata. This is replicated on all devices. For arrays with externally managed metadata, the offset is from the beginning of the device. bitmap/chunksize The size, in bytes, of the chunk which will be represented by a single bit. For RAID456, it is a portion of an individual device. For RAID10, it is a portion of the array. For RAID1, it is both (they come to the same thing). bitmap/time_base The time, in seconds, between looking for bits in the bitmap to be cleared. In the current implementation, a bit will be cleared between 2 and 3 times "time_base" after all the covered blocks are known to be in-sync. bitmap/backlog When write-mostly devices are active in a RAID1, write requests to those devices proceed in the background - the filesystem (or other user of the device) does not have to wait for them. 'backlog' sets a limit on the number of concurrent background writes. If there are more than this, new writes will by synchronous. bitmap/metadata This can be either 'internal' or 'external'. 'internal' is the default and means the metadata for the bitmap is stored in the first 256 bytes of the allocated space and is managed by the md module. 'external' means that bitmap metadata is managed externally to the kernel (i.e. by some userspace program) bitmap/can_clear This is either 'true' or 'false'. If 'true', then bits in the bitmap will be cleared when the corresponding blocks are thought to be in-sync. If 'false', bits will never be cleared. This is automatically set to 'false' if a write happens on a degraded array, or if the array becomes degraded during a write. When metadata is managed externally, it should be set to true once the array becomes non-degraded, and this fact has been recorded in the metadata. As component devices are added to an md array, they appear in the 'md' directory as new directories named dev-XXX where XXX is a name that the kernel knows for the device, e.g. hdb1. Each directory contains: block a symlink to the block device in /sys/block, e.g. /sys/block/md0/md/dev-hdb1/block -> ../../../../block/hdb/hdb1 super A file containing an image of the superblock read from, or written to, that device. state A file recording the current state of the device in the array which can be a comma separated list of faulty - device has been kicked from active use due to a detected fault, or it has unacknowledged bad blocks in_sync - device is a fully in-sync member of the array writemostly - device will only be subject to read requests if there are no other options. This applies only to raid1 arrays. blocked - device has failed, and the failure hasn't been acknowledged yet by the metadata handler. Writes that would write to this device if it were not faulty are blocked. spare - device is working, but not a full member. This includes spares that are in the process of being recovered to write_error - device has ever seen a write error. want_replacement - device is (mostly) working but probably should be replaced, either due to errors or due to user request. replacement - device is a replacement for another active device with same raid_disk. This list may grow in future. This can be written to. Writing "faulty" simulates a failure on the device. Writing "remove" removes the device from the array. Writing "writemostly" sets the writemostly flag. Writing "-writemostly" clears the writemostly flag. Writing "blocked" sets the "blocked" flag. Writing "-blocked" clears the "blocked" flags and allows writes to complete and possibly simulates an error. Writing "in_sync" sets the in_sync flag. Writing "write_error" sets writeerrorseen flag. Writing "-write_error" clears writeerrorseen flag. Writing "want_replacement" is allowed at any time except to a replacement device or a spare. It sets the flag. Writing "-want_replacement" is allowed at any time. It clears the flag. Writing "replacement" or "-replacement" is only allowed before starting the array. It sets or clears the flag. This file responds to select/poll. Any change to 'faulty' or 'blocked' causes an event. errors An approximate count of read errors that have been detected on this device but have not caused the device to be evicted from the array (either because they were corrected or because they happened while the array was read-only). When using version-1 metadata, this value persists across r