Example: tourism industry

Large Data Analysis with Python - PyTables

The Starving CPU ProblemHigh Performance LibrariesLarge data Analysis with PythonFrancesc AltedFreelance Developer and PyTables CreatorG-NodeNovember 24th, 2010. Munich, GermanyFrancesc AltedLarge data Analysis with PythonThe Starving CPU ProblemHigh Performance LibrariesWhere do I live?Francesc AltedLarge data Analysis with PythonThe Starving CPU ProblemHigh Performance LibrariesWhere do I live?Francesc AltedLarge data Analysis with PythonThe Starving CPU ProblemHigh Performance LibrariesSome Words About PyTablesStarted as a solo project back in 2002. I had a necessity todeal with very Large amounts of data and needed to scratchmy on handling Large series of tabular data :Buffered I/O for maximum fast selections through the use of indexing for top-class performance from selling PyTables Pro sponsors part of myinvested AltedLarge data Analysis with PythonThe Starving CPU ProblemHigh Performance LibrariesSome PyTables UsersFrancesc AltedLarge data Analysis with PythonThe Starving CPU ProblemHigh Performance LibrariesOutline1 The Starving CPU ProblemGetting the Most Out of ComputersCaches and data LocalityTechniques For Fighting data Starvation2 High Performance LibrariesWhy Should You Use Them?

The Starving CPU Problem High Performance Libraries Large Data Analysis with Python Francesc Alted Freelance Developer and PyTables Creator G-Node

Tags:

  Analysis, Python, With, Large, Data, Large data analysis with python

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Advertisement

Transcription of Large Data Analysis with Python - PyTables

1 The Starving CPU ProblemHigh Performance LibrariesLarge data Analysis with PythonFrancesc AltedFreelance Developer and PyTables CreatorG-NodeNovember 24th, 2010. Munich, GermanyFrancesc AltedLarge data Analysis with PythonThe Starving CPU ProblemHigh Performance LibrariesWhere do I live?Francesc AltedLarge data Analysis with PythonThe Starving CPU ProblemHigh Performance LibrariesWhere do I live?Francesc AltedLarge data Analysis with PythonThe Starving CPU ProblemHigh Performance LibrariesSome Words About PyTablesStarted as a solo project back in 2002. I had a necessity todeal with very Large amounts of data and needed to scratchmy on handling Large series of tabular data :Buffered I/O for maximum fast selections through the use of indexing for top-class performance from selling PyTables Pro sponsors part of myinvested AltedLarge data Analysis with PythonThe Starving CPU ProblemHigh Performance LibrariesSome PyTables UsersFrancesc AltedLarge data Analysis with PythonThe Starving CPU ProblemHigh Performance LibrariesOutline1 The Starving CPU ProblemGetting the Most Out of ComputersCaches and data LocalityTechniques For Fighting data Starvation2 High Performance LibrariesWhy Should You Use Them?

2 In-Core High Performance LibrariesOut-of-Core High Performance LibrariesFrancesc AltedLarge data Analysis with PythonThe Starving CPU ProblemHigh Performance LibrariesGetting the Most Out of ComputersCaches and data LocalityTechniques For Fighting data StarvationOutline1 The Starving CPU ProblemGetting the Most Out of ComputersCaches and data LocalityTechniques For Fighting data Starvation2 High Performance LibrariesWhy Should You Use Them?In-Core High Performance LibrariesOut-of-Core High Performance LibrariesFrancesc AltedLarge data Analysis with PythonThe Starving CPU ProblemHigh Performance LibrariesGetting the Most Out of ComputersCaches and data LocalityTechniques For Fighting data StarvationGetting the Most Out of Computers: An Easy Goal?Computers nowadays are very powerful:Extremely fast CPU s (multicores) Large amounts of RAMHuge disk capacitiesBut they are facing a pervasive problem:An ever-increasing mismatch between CPU, memory and diskspeeds (the so-called Starving CPU problem )

3 This introduces tremendous difficulties in getting the most out AltedLarge data Analysis with PythonThe Starving CPU ProblemHigh Performance LibrariesGetting the Most Out of ComputersCaches and data LocalityTechniques For Fighting data StarvationOnce Upon A the 1970s and 1980s the memory subsystem was able todeliver all the data that processors required in the good old days, the processor was the key in the 1990s things started to AltedLarge data Analysis with PythonThe Starving CPU ProblemHigh Performance LibrariesGetting the Most Out of ComputersCaches and data LocalityTechniques For Fighting data StarvationOnce Upon A the 1970s and 1980s the memory subsystem was able todeliver all the data that processors required in the good old days, the processor was the key in the 1990s things started to AltedLarge data Analysis with PythonThe Starving CPU ProblemHigh Performance LibrariesGetting the Most Out of ComputersCaches and data LocalityTechniques For Fighting data StarvationCPU vs Memory Cycle TrendFrancesc AltedLarge data Analysis with PythonThe Starving CPU ProblemHigh Performance LibrariesGetting the Most Out of ComputersCaches and data LocalityTechniques For Fighting data StarvationThe CPU Starvation ProblemKnown facts (in 2010):Memory latency is much higher (around 250x) than processorsand it has been an essential bottleneck for the past throughput is improving at a better rate than memorylatency, but it is also much slower than processors (about 25x).

4 The result is that CPUs in our current computers are sufferingfroma serious data starvation problem:they could consume (much!)more data than the system can possibly AltedLarge data Analysis with PythonThe Starving CPU ProblemHigh Performance LibrariesGetting the Most Out of ComputersCaches and data LocalityTechniques For Fighting data StarvationOutline1 The Starving CPU ProblemGetting the Most Out of ComputersCaches and data LocalityTechniques For Fighting data Starvation2 High Performance LibrariesWhy Should You Use Them?In-Core High Performance LibrariesOut-of-Core High Performance LibrariesFrancesc AltedLarge data Analysis with PythonThe Starving CPU ProblemHigh Performance LibrariesGetting the Most Out of ComputersCaches and data LocalityTechniques For Fighting data StarvationWhat Is the Industry Doing to Alleviate CPU Starvation?They are improving memory throughput: cheap to implement(more data is transmitted on each clock cycle).

5 They are adding big caches in the CPU AltedLarge data Analysis with PythonThe Starving CPU ProblemHigh Performance LibrariesGetting the Most Out of ComputersCaches and data LocalityTechniques For Fighting data StarvationWhy Is a Cache Useful?Caches are closer to the processor (normally in the same die),so both the latency and throughput are : the faster they run the smaller they must are effective mainly in a couple of scenarios:Time locality: when the dataset is locality: when the dataset is accessed AltedLarge data Analysis with PythonThe Starving CPU ProblemHigh Performance LibrariesGetting the Most Out of ComputersCaches and data LocalityTechniques For Fighting data StarvationTime LocalityParts of the dataset are reusedFrancesc AltedLarge data Analysis with PythonThe Starving CPU ProblemHigh Performance LibrariesGetting the Most Out of ComputersCaches and data LocalityTechniques For Fighting data StarvationSpatial LocalityDataset is accessed sequentiallyFrancesc AltedLarge data Analysis with PythonThe Starving CPU ProblemHigh Performance LibrariesGetting the Most Out of ComputersCaches and data LocalityTechniques For Fighting data StarvationOutline1 The Starving CPU ProblemGetting the Most Out of ComputersCaches and data LocalityTechniques For Fighting data Starvation2 High Performance LibrariesWhy Should You Use Them?

6 In-Core High Performance LibrariesOut-of-Core High Performance LibrariesFrancesc AltedLarge data Analysis with PythonThe Starving CPU ProblemHigh Performance LibrariesGetting the Most Out of ComputersCaches and data LocalityTechniques For Fighting data StarvationThe Blocking TechniqueWhen you have to access memory, get acontiguousblock that fitsin the CPU cache, operate upon it orreuse itas much as possible,then write the block back to memory: Francesc AltedLarge data Analysis with PythonThe Starving CPU ProblemHigh Performance LibrariesGetting the Most Out of ComputersCaches and data LocalityTechniques For Fighting data StarvationUnderstand NumPy Memory LayoutBeing a a squared array (4000x4000) of doubles, we have:Summing up column-wisea[:,1].sum()# takes msSumming up row-wise: more than 100x faster (!)a[1,:].sum()# takes 72 sRemember:NumPy arrays are ordered row-wise (C convention)Francesc AltedLarge data Analysis with PythonThe Starving CPU ProblemHigh Performance LibrariesGetting the Most Out of ComputersCaches and data LocalityTechniques For Fighting data StarvationUnderstand NumPy Memory LayoutBeing a a squared array (4000x4000) of doubles, we have:Summing up column-wisea[:,1].

7 Sum()# takes msSumming up row-wise: more than 100x faster (!)a[1,:].sum()# takes 72 sRemember:NumPy arrays are ordered row-wise (C convention)Francesc AltedLarge data Analysis with PythonThe Starving CPU ProblemHigh Performance LibrariesGetting the Most Out of ComputersCaches and data LocalityTechniques For Fighting data StarvationVectorize Your CodeNaive matrix-matrix multiplication: 1264 s (1000x1000 doubles)def dot_naive(a,b):# MFlopsc = ((nrows, ncols), dtype= f8 )for row in xrange(nrows):for col in xrange(ncols):for i in xrange(nrows):c[row,col] += a[row,i] * b[i,col]return cVectorized matrix-matrix multiplication: 20 s (64x faster)def dot(a,b):# 100 MFlopsc = ((nrows, ncols), dtype= f8 )for row in xrange(nrows):for col in xrange(ncols):c[row, col] = (a[row] * b[:,col])return cFrancesc AltedLarge data Analysis with PythonThe Starving CPU ProblemHigh Performance LibrariesGetting the Most Out of ComputersCaches and data LocalityTechniques For Fighting data StarvationThe Consequences of the Starving CPU ProblemThe gap between CPU and memory speed is simply huge (andgrowing)Over time, an increasing number of applications will beaffected by memory accessFortunately, hardware manufacturers are creating novel solutions forfighting CPU starvation!

8 But vendors cannot solve the problem scientists need another way to look at theircomputers: data arrangement, not code itself, is central to program AltedLarge data Analysis with PythonThe Starving CPU ProblemHigh Performance LibrariesGetting the Most Out of ComputersCaches and data LocalityTechniques For Fighting data StarvationThe Consequences of the Starving CPU ProblemThe gap between CPU and memory speed is simply huge (andgrowing)Over time, an increasing number of applications will beaffected by memory accessFortunately, hardware manufacturers are creating novel solutions forfighting CPU starvation!But vendors cannot solve the problem scientists need another way to look at theircomputers: data arrangement, not code itself, is central to program AltedLarge data Analysis with PythonThe Starving CPU ProblemHigh Performance LibrariesWhy Should You Use Them?In-Core High Performance LibrariesOut-of-Core High Performance LibrariesOutline1 The Starving CPU ProblemGetting the Most Out of ComputersCaches and data LocalityTechniques For Fighting data Starvation2 High Performance LibrariesWhy Should You Use Them?

9 In-Core High Performance LibrariesOut-of-Core High Performance LibrariesFrancesc AltedLarge data Analysis with PythonThe Starving CPU ProblemHigh Performance LibrariesWhy Should You Use Them?In-Core High Performance LibrariesOut-of-Core High Performance LibrariesWhy High Performance Libraries?High performance libraries are made by people that knows verywell the different optimization may be tempted to create original algorithms that can befaster than these, but in general, it is very difficult to some cases, it may take some time to get used to them, butthe effort pays off in the long AltedLarge data Analysis with PythonThe Starving CPU ProblemHigh Performance LibrariesWhy Should You Use Them?In-Core High Performance LibrariesOut-of-Core High Performance LibrariesNumPy: A Powerful data Container for PythonNumPy provides a very powerful, object oriented, multi-dimensionaldata container:array[index]: retrieves a portion of a data container(array1**3 / array2) - sin(array3): evaluatespotentially complex (array1, array2): access to optimized BLAS(*GEMM) functionsFrancesc AltedLarge data Analysis with PythonThe Starving CPU ProblemHigh Performance LibrariesWhy Should You Use Them?

10 In-Core High Performance LibrariesOut-of-Core High Performance LibrariesNumPy: The Cornerstone of Python Numerical AppsFrancesc AltedLarge data Analysis with PythonThe Starving CPU ProblemHigh Performance LibrariesWhy Should You Use Them?In-Core High Performance LibrariesOut-of-Core High Performance LibrariesOutline1 The Starving CPU ProblemGetting the Most Out of ComputersCaches and data LocalityTechniques For Fighting data Starvation2 High Performance LibrariesWhy Should You Use Them?In-Core High Performance LibrariesOut-of-Core High Performance LibrariesFrancesc AltedLarge data Analysis with PythonThe Starving CPU ProblemHigh Performance LibrariesWhy Should You Use Them?In-Core High Performance LibrariesOut-of-Core High Performance LibrariesSome In-Core High Performance LibrariesATLAS/MKL(Intel s Math Kernel Library): Uses memory efficientalgorithms as well as SIMD and multi-core algorithms linear algebra (Intel s Vector Math Library): Uses SIMD andmulti-core to compute basic math functions (sin, cos,exp, ) in :Performs potentially complex operations with NumPyarrays without the overhead of temporaries.


Related search queries