Example: biology

NVIDIA DGX OS SERVER VERSION 3.1 - docs.nvidia.com

DA-08260-317_v04 | January 20 19 Release Notes and Update guide NVIDIA DGX OS SERVER VERSION NVIDIA DGX OS SERVER VERSION Release Notes ii TABLE OF CONTENTS NVIDIA DGX OS SERVER Release Notes for VERSION .. 3 Update Advisement .. 4 About Release .. 4 Highlights and Changes in VERSION .. 5 Known Issues .. 5 ubuntu / Linux Kernel Issues .. 8 DGX OS SERVER Software Content .. 10 VERSION Reference .. 11 Pascal .. 11 Volta (16 GB) .. 11 Volta (32 GB) .. 11 Updating to VERSION .. 12 Update Path Instructions .. 12 Connecting to the DGX-1 Console .. 13 Verifying the DGX-1 Connection to the Repositories .. 14 Updating from to .. 14 Update Instructions .. 14 Recovering from an Interrupted Update .. 17 Updating from to .. 18 Update Instructions .. 18 Verifying the NVIDIA -peer-memory Module.

User Guide for instructions on how to install the software on air-gapped systems. ... As of the DGX OS Server 3.1.7 release, the Ubuntu kernel (4.4.0-124) is still subject to this issue. A later kernel version may resolve the issue, at which point an over-the-network

Tags:

  Guide, Ubuntu, Server

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of NVIDIA DGX OS SERVER VERSION 3.1 - docs.nvidia.com

1 DA-08260-317_v04 | January 20 19 Release Notes and Update guide NVIDIA DGX OS SERVER VERSION NVIDIA DGX OS SERVER VERSION Release Notes ii TABLE OF CONTENTS NVIDIA DGX OS SERVER Release Notes for VERSION .. 3 Update Advisement .. 4 About Release .. 4 Highlights and Changes in VERSION .. 5 Known Issues .. 5 ubuntu / Linux Kernel Issues .. 8 DGX OS SERVER Software Content .. 10 VERSION Reference .. 11 Pascal .. 11 Volta (16 GB) .. 11 Volta (32 GB) .. 11 Updating to VERSION .. 12 Update Path Instructions .. 12 Connecting to the DGX-1 Console .. 13 Verifying the DGX-1 Connection to the Repositories .. 14 Updating from to .. 14 Update Instructions .. 14 Recovering from an Interrupted Update .. 17 Updating from to .. 18 Update Instructions .. 18 Verifying the NVIDIA -peer-memory Module.

2 21 Enabling Dynamic DNS Updates .. 22 Recovering from an Interrupted Update .. 23 Updating from to .. 24 Update Instructions .. 24 Verifying the NVIDIA -peer-memory Module .. 25 Recovering from an Interrupted or Failed Update .. 25 Appendix A. Third Party License Notice .. 26 NVIDIA DGX OS SERVER VERSION Release Notes 3 NVIDIA DGX OS SERVER RELEASE NOTES FOR VERSION This document describes VERSION of the NVIDIA DGX OS SERVER Release software and update package. DGX OS SERVER is provided as an over the network update, and requires an internet connection and ability to access the NVIDIA public repository. See the chapter Updating to VERSION for instructions on performing the update. If your DGX-1 is not connected to a network with internet access, refer to the DGX-1 User guide for instructions on how to install the software on air-gapped systems.

3 NVIDIA DGX OS SERVER Release Notes for VERSION NVIDIA DGX OS SERVER VERSION Release Notes 4 UPDATE ADVISEMENT DGX OS SERVER software NVIDIA recommends updating the DGX OS SERVER software on their DGX-1 systems from VERSION or to VERSION See the Highlights section for details of VERSION NVIDIA Docker Containers In conjunction with DGX OS SERVER , customers should update their NVIDIA Docker containers to the latest container release1. ubuntu Security Updates Customers are responsible for keeping the DGX-1 up to date with the latest ubuntu security updates using the apt-get upgrade procedure. See the ubuntu Wiki Upgrades web page for more information. ABOUT RELEASE The following are the primary features of the DGX OS SERVER Release : Supports DGX-1 using NVIDIA Pascal as well as Volta GPUs.

4 ubuntu LTS Initialization daemon changed from Upstart to systemd. Updated network interface naming policy. Policy now uses predictable names, rather than the native naming scheme used in previous releases. The first and second Ethernet interfaces, enumerated as em1 and em2 in previous releases, will now enumerate as enp1s0f0 and enp1s0f1 respectively. NVIDIA GPU Driver Release 384 Supports the NVIDIA Tesla V100 GPUs. Supports CUDA CUDA drivers and diagnostic packages updated to Release 384 Mellanox drivers updated to Docker CE and the Docker Engine Utility for NVIDIA GPUs are pre-installed, and the docker daemon automatically launched. 1 See the NVIDIA Deep Learning Frameworks documentation website ( ) for information on the latest container releases as well as instructions for how to access them.

5 NVIDIA DGX OS SERVER Release Notes for VERSION NVIDIA DGX OS SERVER VERSION Release Notes 5 HIGHLIGHTS AND CHANGES IN VERSION NVIDIA GPU Driver VERSION Includes security updates for driver components. Added support for updating to the NVIDIA Container Runtime for Docker Updated nvsysinfo to VERSION Updated nvhealth to VERSION KNOWN ISSUES python3-distupgrade Error May Occur During DGX OS Update Script Cannot Recreate RAID Array After Re-inserting a Known Good SSD Software Power Cap Not Reported Correctly by NVIDIA -smi GPUs Cannot be Reset While the System is Running Apparmor Profile May not Work with Some Containers python3-distupgrade Error May Occur During DGX OS Update Iss ue When updating to the latest DGX OS , errors may occur during the python3-distupgrade configuration stage, preventing a complete update.

6 Expla nation an d Workar ound The file /usr/lib/python3/dist-packages/ is either missing or renamed erroneously as the result of a previous update. If you encounter python3-distupdate or python3-update-manager errors, perform the following to complete the update. $ sudo touch /usr/lib/python3/dist-packages/ $ sudo apt -f install NVIDIA DGX OS SERVER Release Notes for VERSION NVIDIA DGX OS SERVER VERSION Release Notes 6 Script Cannot Recreate RAID Array After Re-inserting a Known Good SSD Issue When a good SSD is removed from the DGX-1 RAID 0 array and then re-inserted, the script to recreate the array fails. Workaround After re-inserting the SSD back into the system, the RAID controller sets the array to offline and marks the re-inserted SSD as Unconfigured_Bad (UBad). The script will fail when attempting to rebuild an array when one or more of the SSDs are marked Ubad.

7 To recreate the array in this case, 1. Set the drive back to a good state. # sudo /opt/MegaRAID/storcli/storcli64 /c0/e<enclosure_id>/s<drive_slot> set good 2. Run the script to recreate the array. # sudo /usr/ -c -f Software Power Cap Not Reported Correctly by NVIDIA -smi Issue On DGX-1 systems with Pascal GPUs, NVIDIA -smi does not report Software Power Cap as "Active" when clocks are throttled by power draw. Workaround This issue is with NVIDIA -smi reporting and not with the actual functionality. This will be fixed in a future release. GPUs Cannot be Reset While the System is Running Issue You will not be able to reset the GPUs while the system is running. NVIDIA DGX OS SERVER Release Notes for VERSION NVIDIA DGX OS SERVER VERSION Release Notes 7 Workaround If an issue occurs which causes the GPUs to hang or if they need to be reset, you must reboot the system.

8 Apparmor Profile May not Work with Some Containers Issue Apparmor is enabled in this VERSION of the DGX OS SERVER , with Docker generating a default profile. The default profile may or may not work with your containers. Workaround If there is a conflict with your containers, then either Disable Apparmor, or Provide a custom Apparmor profile and include it in the docker run command. NVIDIA DGX OS SERVER Release Notes for VERSION NVIDIA DGX OS SERVER VERSION Release Notes 8 ubuntu / LINUX KERNEL ISSUES The following are known issues with either the ubuntu OS or the Linux kernel that affect the DGX-1. System May Slow Down When Using mpirun FS-Cache Assertion Error Leading to System Panic May Occur Network Performance Drop System May Slow Down When Using mpirun Issue Customers running Message Passing Interface (MPI) workloads may experience the OS becoming very slow to respond.

9 When this occurs, a log message similar to the following would appear in the kernel log: kernel BUG at /build/linux-fQ94 :1899! Explanation Due to the current design of the Linux kernel, the condition may be triggered when get_user_pages is used on a file that is on persistent storage. For example, this can happen when cudaHostRegister is used on a file path that is stored in an ext4 filesystem. DGX systems implement /tmp on a persistent ext4 filesystem. Workaround NOTE: If you performed this workaround on a previous DGX OS software VERSION , you do not need to do it again after updating to the latest DGX OS VERSION . In order to avoid using persistent storage, MPI can be configured to use shared memory at /dev/shm (this is a temporary filesystem). If you are using Open MPI, then you can solve the issue by configuring the Modular Component Architecture (MCA) parameters so that mpirun uses the temporary file system in memory.

10 For details on how to accomplish this, see the Knowledge Base Article DGX System Slows Down When Using mpirun (requires login to NVIDIA Enterprise Support). NVIDIA DGX OS SERVER Release Notes for VERSION NVIDIA DGX OS SERVER VERSION Release Notes 9 FS-Cache Assertion Error Leading to System Panic May Occur Issue An issue in the Linux kernel can, under some workloads, cause a kernel thread to crash due to an FS-Cache service assertion failure. This can be confirmed by examining the kernel logs (/var/ *). Example: Mar 27 11:19:42 dev-dgx01 kernel: [ ] FS-Cache: Mar 27 11:19:42 dev-dgx01 kernel: [ ] FS-Cache: Assertion failed Mar 27 11:19:42 dev-dgx01 kernel: [ ] FS-Cache: 6 == 5 is false Mar 27 11:19:42 dev-dgx01 kernel: [ ] ------------[ cut here ]------------ Mar 27 11:19:42 dev-dgx01 kernel: [ ] kernel BUG at /build/linux-3 :494!


Related search queries