Transcription of KTLS: Linux Kernel Transport Layer Security
1 ktls : Linux Kernel Transport Layer Security1stDave WatsonFacebookSan Francisco, Layer Security (TLS) is a widely-deployed proto-col used for securing TCP connections on the Internet. TLSis also a required feature for HTTP/2, the latest web Kernel implementations provide new opportunities for op-timization of TLS. This paper explores a possible Kernel TLSimplementation, as well as the Kernel features it enables, suchas sendfile(), BPF programs, and hardware TLS offload. Ourimplementation saves up to 7% CPU copy overhead and upto 10% latency improvements when combined with the KernelConnection Multiplexor (KCM).KeywordsTLS, DTLS, Linux , Security ,performance,sockets,OpenS SL, offloadIntroductionTransport Layer Security [2] (TLS) and Datagram TransportLayer Security (DTLS) are building blocks for Transport se-curity on the modern internet. The latest version of the Hy-pertext Transfer Protocol [1] (HTTP/2) specifies the use ofTLS.
2 It provides both encryption and authentication of TCPconnections, but comes with a CPU cost. TLS and DTLS consists of two primary operations: first a TLS handshake isperformed to negotiate a secure symmetric encryption algo-rithm and keys, and then TLS symmetric encryption is per-formed on TLS records. TLS has several types of records,including data records and control is a UDP based encryption protocol. Most elementsof TLS are reused, with minor changes to support the statelessdatagram messages. ktls supports DTLS messages, andimplements a sliding window for replay encrypts the majority of its external traffic overHTTPS. Internal traffic is also encrypted if there is enoughavailable CPU. Internal traffic is served over Apache Thrift[6], Facebook s RPC framework, and is also encrypted s HTTP/2 web servers and RPC servers bothfunction similarly.
3 One thread per core is dedicated to anepoll() event loop. When epollwait() returns a list of activeconnections, read() and write() are used to read in the requestand then send the static or dynamic response. In a TLS en-abled service, OpenSSL s SSLread and SSLwrite primitivesFigure 1: Standard web server with OpenSSLare used instead. Since OpenSSL is a user space library, alldata must be in user space to be encrypted. Facebook s cur-rent SSL overheads result in approximately 2% of total CPUspent on copy from/copy to user space due to encryption, andapproximately 10% of total CPU is spent on encryption anddecryption routines on machines that make heavy use of eliminate the overhead due to copies, Facebook has in-vestigated using the sendfile() or splice() system calls to sendstatic content directly from disk to the network, without anycopies through user space.
4 Unfortunately, do to our wide de-ployment of TLS, this hasn t been possible, due to the needto encrypt data in user space. Facebook has also experi-mented with the Kernel Connection Multiplexor [3] (KCM),and found reductions in tail latencies. Unfortunately it wouldrequire access to the unencrypted bytes in the the data in the Kernel results in ideally zerocopies, with only the encryption taking the bulk of the CPUusage. User space only needs to inform the Kernel of whichdata needs to be encrypted. The Linux Kernel has an existingcrypto interface, afalg, that can be used to do bulk encryp-tion, but additional overhead is required to add the framingfrom user space. It also lacks an efficient interface to NIChardware encryption 2: Server using ktls and sendfileApproachFacebook, in collaboration with RedHat, have implementeda Linux Kernel TLS socket.
5 To avoid putting unnecessarycomplexity in the Kernel , the TLS handshake is kept in userspace. A full TLS connection using the socket is done usingthe following scheme: Call connect() or accept() on a standard TCP file descriptor. A user space TLS library is used to complete a have tested with both GnuTLS and OpenSSL. Create a new ktls socket file descriptor. Extract the TLS Initialization Vectors (IVs), session keys,and sequence IDs from the TLS library. Use setsockopt onthe ktls fd to pass them to the Kernel . Use standard read(), write(), sendfile() and splice() systemcalls on the ktls receipt of a non-data TLS message (a control mes-sage), the ktls socket returns an error, and the message isinstead left on the original TCP socket. The ktls socketis automatically unattached. Transfer of control back to theoriginal encrypted FD is done by calling getsockopt to re-ceive the current sequence numbers, and inserting them in tothe TLS library.
6 Example:i f ( r e a d ( .. )<0 ){g e t s o c k o p t ( t l sf d ,AFKTLS ,KTLSGETIVRECV ,s s l >s 3 >r e a ds e q u e n c e ,&o p t l e n ) ;/ S i m i l a r f o r IVSEND /S S Lr e a d ( t c pf d ,.. ) ;}In this scheme, the TLS library is used to handle the con-trol messages and do the handshake, and does not need tobe modified. It can maintain control of the original TCP fd,while unencrypted data flows through the ktls socket. Theuser space application only needs to handle application data,and use standard socket system of the complexity in this scheme is the buffer man-agement between the two FDs, and handing off control whencontrol messages are received. While it is reasonable to nothandle most control messages Facebook s servers shutdownthe connection on receipt of a control message the clientsending the control message is still expecting a response, soto enable a clean shutdown, control must still be passed backto the original TLS handshake library to send the appropri-ate response.
7 To help manage this complexity, the strparser[9] library was developed to manage parsing TCP buffers asdatagram Framework ChangesThe Linux crypto framework already contains the two sym-metric ciphers included in the TLS draft, GCM-AES andChaCha/Poly. Current Intel chipsets support AESNI instruc-tion set, which allows fast encryption and decryption routinesusing GCM-AES. These routines were already implementedin assembly for IPSEC, however, they required minor modifi-cations to work with TLS. TLS s AAD data is 13 bytes, whileIPSEC uses 16 bytes. An additional template was added tosupport the correct AAD size. The asm routines still requirea full 16 bytes for current AESNI crypto interface requires all parts ofthe message to be contiguous - including the AAD and tagdata. This presents a slight performance hit on both send andreceive for send, we can reuse almost the same AAD everytime, and don t need to reallocate space for it on encrypt.
8 Onreceive, we want to strip both the AAD and tag data beforepassing it to user space. Minor changes to the interface canbe made to support separate locations for the AAD, user data,and Connection MultiplexorFacebook s primary motivation was to gain access to the un-encrypted bytes in Kernel space. KCM is used to decodethe framing, and make intelligent scheduling choices, beforesending the frames to user space. ktls sockets are mapped1:N to user space sockets, where N is the number of userspace threads, which are usually mapped to cores. Using thisscheme, ktls + KCM is able to reduce the total number ofthread migrations of an individual have implemented the ktls Linux Kernel module, andrun it in production. ktls encryption speed is on par withuser space encryption speed. The number of active file de-scriptors increased, due to using an FD for both the TCPsocket and the ktls socket.
9 The services ran without anychange in functionality, and only minor service code changes,for the duration of the test. A KCM socket scheme, as de-scribed above, was run on top of the ktls socket for theservice KCM was updated to be able to directly attach to aKTLS socket. KCM results in a 10%-20% drop in the 99thpercentile latency for the service. See figure also tested the performance of using sendfile(). Thebase scheme of read() followed by write() of static data froma file to a socket was benchmarked, followed by several 3: ktls + KCM 99thpercentile latency (green) vs. OpenSSL (blue) in msSSLmmapsendfilemmap+vmsplice9010011010 0979397 Percent normalized CPU timeFigure 4: Various schemes to send files form disk, normalizedto read(file) SSLwrite(tcp fd)SSLUse OpenSSL to read() a file to a user space buffer andSSLwrite() it to a tcp a file, then calling SSLwritesendfileCalling sendfile on a ktls fdmmap+vmsplicemmaping file data, using OpenSSL toframe and encrypt it, then calling vmsplice() to send thedata to a tcp splice() calls can be used in place of sendfile().
10 Testing configuration was IntelR XeonR CPU E5-2660 Test was a send of a 2GB data file from disk, nor-malized CPU usage for ten runs. Sender used above schemes,receiver used rough breakdown of top CPU usage for the Kernel send-file n c r y p tb y8n e w n c r y p tb y8 e tA A Dl o o p 2d o n e 1 9 6 t l ss e n d p a g eand for a mmap + SSLwrite g c mg h a s hc l m u a e s n ic t r 3 2e n c r y p tb l o c k c o p yu s e re n h a n c e df a s ts t r i n t c ps e n d m s f i l e m a pm a pp a g e sUtilizing vmsplice we were able to remove the copy fromuser space to the Kernel s tcp buffers, but it was replaced byVM page management enables access to the unencrypted bytes in the majority of the code is related to buffer managementand translation from the message-oriented TLS protocol tothe stream-oriented BSD socket interface.