Applied Data Science - GitHub Pages

Applied data ScienceIan LangmoreDaniel Krasner2 ContentsI Programming Prerequisites11 History and Culture .. The Shell .. Streams .. streams .. Text .. Philosophy .. a nutshell .. nuts and bolts .. End Notes .. 112 Version Control with Background .. What is Git .. Setting Up .. Online Materials .. Basic Git Concepts .. Common Git Workflows .. Move from Working to Remote .. changes in your working copy .. changes .. conflicts .. 183 Building a data Cleaning Pipeline with Simple Shell Scripts .. Template for a python CLI Utility .. 21iiiCONTENTSII The Classic Regression Models234 Notation for Structured data .. 245 Linear Introduction .. Coefficient Estimation: Bayesian Formulation .. setup .. Gaussian World .. Coefficient Estimation: Optimization Formulation.

Least squares problem and the singular value de-composition .. examples .. the regularization parameter .. techniques .. Variable Scaling and Transformations .. variable scaling .. transformations of variables .. transformations and segmentation .. Error Metrics .. End Notes .. 546 Logistic Formulation .. s viewpoint .. viewpoint .. generating viewpoint .. Determining the regression coefficientw.. Multinomial logistic regression .. Logistic regression for classification .. L1 regularization .. Numerical solution .. descent .. s method .. theL1 regularized problem .. numerical issues .. Model evaluation .. End Notes .. 73 CONTENTSiii7 Models Behaving End Notes .. 75 III Text Data768 Processing A Quick Introduction .. Regular Expressions .. Concepts.

Command line and regular expressions .. State Automata and PCRE .. python RE Module .. The python NLTK Library .. NLTK Corpus and Some Fun things to do .. 87IV Classification899 A Quick Introduction .. Naive Bayes .. Measuring Accuracy .. metrics and ROC Curves .. Other classifiers .. Trees .. Forest .. classification .. Entropy .. 103V Extras10510 High(er) performance Memory hierarchy .. Parallelism .. Practical performance in python .. Profiling .. Standard python rules of thumb .. For loops versus BLAS .. Multiprocessing Pools .. Multiprocessing example: Stream processing text files Numba .. Cython .. 129 CONTENTSvWhat is data Science ? with the major technological advances of the lasttwo decades, coupled in part with the internet explosion, a new breed ofanalysist has emerged. The exact role, background, and skill-set, of adatascientistare still in the process of being defined and it is likely that by thetime you read this some of what we say will seem very general terms, we view a data scientist as an individual who usescurrent computational techniques to analyze data .

Now you might makethe observation that there is nothing particularly novel in this, and subse-quenty ask what has forced the all statisticians, physicists,biologisitcs, finance quants, etc have been looking at data since their respec-tive fields emerged. One short answer comes from the fact that the datasphere has changed and, hence, a new set of skills is required to navigate iteffectively. The exponential increase in computational power has providednew means to investigate the ever growing amount of data being collectedevery second of the day. What this implies is the fact that any moderndata analyst will have to make the time investment to learn computationaltechniques necessary to deal with the volumes and complexity of the dataof today. In addition to those of mathemics and statistics, these softwareskills are domain transfereable and so it makes sense to create a job titlethat is also transferable. We could also point to the data hype created inindustry as a culprit for the termdata sciencewith thesciencecreating anaura of validity and facilitating LinkedIn skills are needed?

One neat way we like to visualize the datascience skill set is with Drew Conway s Venn Diagram[Con], see figure and statistics is what allows us to properly quantify a phenomenonobserved in data . For the sake of narrative lets take a complex deterministicsituation, such as whether or not someone will make a loan payment, andattempt to answer this question with a limited number of variables and animperfect understanding of those variables influence on the event we wish topredict. with the exception of your friendly real estate agent we generallyacknowldege our lack of soothseer ability and make statements about theprobability of this event. These statements take a mathematical form, forexampleP[makes-loan-payment] =e + S. Cleveland decide to coin the termdata scienceand writeData Science :An action plan for expanding the technical areas of the field of statistics[Cle]. His reportoutlined six points for a university to follow in developing a data analyst 1: Drew Conway s Venn Diagramwhere the above quantifies theriskassociated with this event.

Deciding onthe best coefficients and can be done quite easily by a host of softwarepackages. In fact anyone with decent hacking skills can do achieve the course, a simple model such as this would convince no one and wouldcall for substantive expertise (more commonly calleddomain knowledge) tomake real progress. In this case, a domain expert would note that additionalvariables such as the loan to value ratio and housing price index are neededas they have a huge effect on payment activity. These variables and manyothers would allow us to arrive at a better modelP[makes-loan-payment] =e + X.(1)Finally we have arrived at a model capable of fooling someone! We couldkeep adding variables until the model will almost certainly fit the historicrisk quite well. BUT, how do we know that this will allow us to quantifyrisk in the future? To make some sense of ouruncertainty2about our modelwe need to know eactly what (1) means. In particular, did we include toomany variables andoverfit?

Did our method of solving (1) arrive at a goodsolution or just numerical noise? Most importantly, how appropriate is thelogistic regression model to begin with ? Answering these questions is oftenas much an art as a Science , but in our experience, sufficient mathematicalunderstanding is necessary to avoid getting distrinction between uncertainty and risk has been talked about quite extensivelyby Nassim Taleb[Tal05, Tal10]CONTENTSviiWhat is the motivation for, and focus of, this course?Just as com-mon as the hacker with no domain knowledge, or the domain expert withno statistical no-how is the traditional academic with meager computingskills. Academia rewards papers containing original theory. For the mostpart it does not reward the considerable effort needed to produce high qual-ity, maintainable code that can be used by others and integrated into largerframeworks. As a result, the type of code typically put forward by academicsis completely unuseable in industry or by anyone else for that matter.

Itis often not the purpose or worth the effort to write production level codein an academic environment. The importance of this cannot be a 20 person start-up that wishes to build a smart-phone app thatrecommends restaurants to users. The data scientist hired for this job willneed to interact with the company database (they will likely not be handeda neat csv file), deal with falsely entered or inconveniently formatted data ,and produce legible reports, as well as a working model for the rest of thecompany to integrate into its production framework. The scientist may beexpected to do this work without much in the way of software support. Now,considering how easy it is to blindly run most predictive software, our hypo-thetical company will be tempted to use a programmer with no statisticalknowledge to do this task. Of course, the programmer will fall into analytictraps such as the ones mentioned above but that might not deter anyonefrom being content with output. This anecdote seems construed, but in re-ality it is something we have seen time and time again.

The current world ofdata analysis calls for a myriad of skills, and clean programming, databaseinteraction and understand of architecture have all become the minimum purpose of this course is to take people with strong mathematical/s-tatistical knowledge and teach them software development course will cover Design of small software packages Working in a Unix environment Designing software in teams Fundamental statistical algorithms such as linear and logistic regres-sion3 Our view of what constitutes the necessary fundamentals is strongly influenced by theteam at software carpentry[Wila]viiiCONTENTS Overfitting and how to avoid it Working with text data ( regular expressions) Time series And more..Part IProgramming Prerequisites1 Chapter 1 UnixSimplicity is the key to brilliance-Bruce History and CultureThe Unix operating system was developed in 1969 at AT&T s Bell Unix lives on through its open source offspring, Linux. This Oper-ating system the dominant force in scientific computing, super computing,and web servers.

In addition, mac OSX (which is unix based) and a varietyof user friendly Linux operating systems represent a significant portion ofthe personal computer market. To understand the reasons for this success,some history is the 1960s, MIT, AT&T Bell Labs, and General Electric developed atime-sharing (meaning different users could share one system) operatingsystem called Multics. Multics was found to be too complicated. This failure led researchers to develop a new operating system that focusedon simplicity. This operating system emphasized ease of communicationamong many simple programs. Kernighan and Pike summarized this as the idea that the power of a system comes more from the relationshipsamong programs than from the programs themselves. The Unix community was integrated with the Internet and networked THE SHELL3 Figure : Ubuntu s GUI and CLIputing from the beginning. This, along with the solid fundamental design,could have led to Unix becoming the dominant computing paradigm duringthe 1980 s personal computer revolution.

Applied Data Science - GitHub Pages

Tags:

Information

Advertisement

Transcription of Applied Data Science - GitHub Pages

Related search queries

Applied Data Science - GitHub Pages

Tags:

Information

Advertisement

Related documents

Machine Learning with Python - Tutorialspoint

TensorFlow - Tutorialspoint

Python For Data Science Cheat Sheet Lists Also see NumPy ...

Aalto Science Institute (AScI) summer research programme ...

Python for Computational Science and Engineering

Python for Finance

Programming and Mathematical Thinking

Related search queries