Transcription of Data Mining with Python (Working draft)
1 Data Mining with Python (Working draft) Finn Arup NielsenNovember 29, 2017 ContentsContentsiList of FiguresviiList of Tablesix1 Other introductions to Python ? .. Why Python for data Mining ? .. Why not Python for data Mining ? .. Components of the Python language and software .. Developing and running Python .. , pypy, IPython .. Notebook .. 2 vs. Python 3 .. in the cloud .. Python in the browser ..72 Basics .. Datatypes .. (bool) .. (int,float,complexandDecimal) .. (str) .. (dict) .. and times .. containers classes .. Functions and arguments .. functions withlambdas.. function arguments .. Object-oriented programming .. as functions .. Modules and import .. import .. with Python 2/3 incompatibility .. Persistency .. and JSON .. Documentation .. Testing .. for type .. testing.
2 Layout and test discovery .. coverage .. in different environments .. Profiling .. Coding style .. Where isprivateandpublic? .. Command-line interface scripting .. Distinguishing between module and script .. Argument parsing .. Exit status .. Debugging .. Logging .. Advices ..313 Python for data Numpy .. Plotting .. plotting .. plotting .. for the Web .. Pandas .. data types .. indexing .. joining, merging and concatenations .. statistics .. SciPy .. transform .. Statsmodels .. Sympy .. Machine learning .. Text Mining .. expressions .. from webpages .. and part-of-speech tagging .. detection .. analysis .. Network Mining .. Miscellaneous issues .. Lazy computation .. Testing data Mining code ..574 Case: Pure Python matrix Code listing ..59ii5 Case: Pima data Problem description and objectives.
3 Descriptive statistics and plotting .. Statistical tests .. Predicting diabetes type ..696 Case: Data Mining a Problem description and objectives .. Reading the data .. Graphical overview on the connections between the tables .. Statistics on the number of tracks sold ..747 Case: Twitter information Problem description and objectives .. Building a news classifier ..758 Case: Big Problem description and objectives .. Stream processing of JSON .. processing of JSON Lines ..78 Bibliography81 Index85iiiivPrefacePython has grown to become one of the central languages in data Mining offering both a general programminglanguage and libraries specifically targeted numerical book is continuously being written and grew out of course given at the Technical University of The Python hierarchy.. Overview of methods and attributes in the common Python 2 built-in data types plotted as aformal concept analysis lattice graph.
4 Only a small subset of methods and attributes is shown. Sklearn classes derivation.. Comorbidity for ICD-10 disease code (appendicitis).. Seaborn correlation plot on the Pima data set .. Database tables graph ..73viiviiiList of Basic built-in and Numpy and Pandas datatypes .. Class methods and attributes .. Testing concepts .. Function for generation of Numpy data structures.. Some of the subpackages of SciPy.. Python machine learning packages .. Scikit-learn methods .. sklearn classifiers .. Metacharacters and character classes .. NLT submodules.. Variables in the Pima data set ..65ixxChapter Other introductions to Python ?Although we cover a bit of introductory Python programming in chapter 2 you should not regard this book asa Python introduction: Several free introductory ressources exist. First and foremost the officialPython Tu-torialat Beginning programmers with no or little programming experiencemay want to look into the bookThink Pythonavailable from as a book [1], while more experienced programmers can start withDive Into Pythonavailable Sheppard s presently 381-pageIntroduction to Python for Econo-metrics, Statistics and Data Analysiscovers both Python basics and Python -based data analysis with Numpy,SciPy, Matplotlib and Pandas, and it is not just relevant for econometrics [2].
5 Developers already well-versed in standard Python development but lacking experience with Python for data Mining can begin withchapter 3. Readers in need of an introduction to machine learning may take a look in Marsland sMachinelearning: An algorithmic perspective[3], that uses Python for its Why Python for data Mining ?Researchers have noted a number of reasons for using Python in the data science area (data Mining , scientificcomputing) [4, 5, 6]:1. Programmers regard Python as a clear and simple language with a highreadability. Even non-programmers may not find it too difficult. The simplicity exists both in the language itself as well asin the encouragement to write clear and simple code prevalent among Python programmers. See thisin contrast to, , Perl where short form variable names allow you to write condensed code but alsorequires you to remember nonintuitive variable names.
6 A Python program may also be 2 5 shorterthan corresponding programs written in Java, C++ or C [7, 8]. Python will run on the three main desktop computing platforms Mac, Linuxand Windows, as well as on a number of other program. With Python you get an interactive prompt with REPL (read-eval-print loop)like in Matlab and R. The prompt facilitates exploratory programming convenient for many datamining tasks, while you still can develop complete programs in an edit-run-debug cycle. The Python -derivatives IPython and Jupyter Notebook are particularly suited for interactive purpose language. Python is a general purpose language that can be used to a wide varietyof tasks beyond data Mining , , user applications, system administration, gaming, web developmentpsychological experiment presentations and recording. This is in contrast to Matlab and further free website for learning Python see see how well Python with its modern data Mining packages compares with R take a look at Carl s blog posts onWill it Python ?
7 2and his GitHub repository where he reproduces R code in Pythonbased on R data analyses from the bookMachine Learning for Python with its BSD license fall in the group offree and open source software. Although somelarge Python development environments may have associated license cost for commercial use, thebasic Python development environment may be setup and run with no licensing cost. Indeed in somesystems, , many Linux distributions, basic Python comes readily installed. The Python PackageIndex provides a large set of packages that are also free community. Python has a large community and has become more popular. Several indicatorstestify to this. Popularity of Language Index (PYPL) bases its programming language ranking onGoogle search volume provided by Google Trends and puts Python in the third position after Java andPHP. According to PYPL the popularity of Python has grown since 2004.
8 TIOBE constructs anotherindicator putting Python in rank 6th. This indicator is based on the number of skilled engineers world-wide, courses and third party vendors .3 Also Python is among the leading programming language interms of StackOverflow tags and GitHub , in 2014 Python was the most popularprogramming language at top-ranked United States universities for teaching introductory programming[9]. : The Coverity company finds that Python code has errors among its 400,000 lines of code,but that the error rate is very low compared to other open source software projects. They found defects per KLoC [10]. Notebook: With the browser-based interactive notebook, where code, textual and plot-ting results and documentation may be interleaved in a cell-based environment, the Jupyter Notebookrepresents a interesting approach that you will typically not find in many other programming lan-guage.
9 Exceptions are the commercial systems Maple and Mathematica that have notebook Notebooks runs locally on a Web-browser. The Notebook files are JSON files that can easilybe shared and rendered on the obvious advantages with the Jupyter Notebook has led other language to use the Jupyter Notebook can be changed to use, , the Julia language as the computational backend, , instead of writing Python code in the code cells of the notebook you write Julia code. Withappropriate extensions the Jupyter Notebook can intermix R Why not Python for data Mining ?Why shouldn t you use Python ? well-suited to mobile phones and other portable devices. Although Python surely canrun on mobile phones and there exist a least one (dated) book for Mobile Python [11], Python has notcaught on for development of mobile apps. There exist several mobile app development frameworkswith Kivy mentioned as leading contender.
10 Developers can also use Python in mobile contexts for thebackend of a web-based system and for data Mining data collected at the not run natively in the browser. Javascript entirely dominates as the language in web-browsers. Various ways exist to mix Python and webbrowser Pyjamas project withits Python -to-Javascript compiler allows you to write webbrowser client code in Python and compile itto Javascript which the webbrowser then runs. There are several other of these stand-alone Javascript2 in various states of development as it is called: PythonJS, Pyjaco, Py2JS. Other frameworksuse in-browser implementations, one of them being Brython, which enable the front-end engineer towrite Python code in a HTML script tag if the page includes library viathe HTML script tag. It supports core Python modules and has access to the DOM API, but not, , the scientific Python libraries written in C.