Free eBook: Applied Data Science (Columbia University)

Free eBook: Applied Data Science (Columbia University)


Published in 2013, but still very interesting, and different from most data science books. Authors: Ian Langmore and Daniel Krasner.. This book focuses more on the statistics end of things, while also getting readers going on (basic) programming & command line skills. It doesn’t, however, really go into much of the stuff you would expect to see from the machine learning end of things. 

Source for picture: check page 68 in the book.
You can download the book here. For other related books, check out our recommended reading list.
Content
I Programming Prerequisites 
1 Unix 

History and Culture . . . . . 2
The Shell . . . . . 3
Streams 5
Standard streams . . . 6
Pipes . . . 7
Text . . 9
Philosophy . . . . 10
In a nutshell . . . . . 10
More nuts and bolts . 10
End Notes . . . . . 11

2 Version Control with Git 

Background . . . . 13
What is Git . . . . 13
Setting Up . . . . . 14
Online Materials . 14
Basic Git Concepts 15
Common Git Workflows . . . 15
Linear Move from Working to Remote
Discarding changes in your working copy . 17
Erasing changes . . . 17
Remotes . . 17
Merge conflicts . . . . 18

3 Building a Data Cleaning Pipeline with Python

Simple Shell Scripts . . . . . 19
Template for a Python CLI Utility . . . 21

II The Classic Regression Models
4 Notation

Notation for Structured Data 24

5 Linear Regression

Introduction . . . . 26
Coefficient Estimation: Bayesian Formulation . . . 29
Generic setup . . . . . 29
Ideal Gaussian World 30
Coefficient Estimation: Optimization Formulation 33
The least squares problem and the singular value decomposition
Overfitting examples . 39
L2 regularization . . . 43
Choosing the regularization parameter . . . 44
Numerical techniques 46
Variable Scaling and Transformations . 47
Simple variable scaling 48
Linear transformations of variables . . . . . 51
Nonlinear transformations and segmentation . . . . . 52
Error Metrics . . . 53
End Notes . . . . . 54

6 Logistic Regression

Formulation . . . . 55
Presenter’s viewpoint 55
Classical viewpoint . . 56
Data generating viewpoint . . . . 57
Determining the regression coefficient w 58
Multinomial logistic regression . . . . . 61
Logistic regression for classification . . . 62
L1 regularization . 64
Numerical solution 66
Gradient descent . . . 67
Newton’s method . . . 68
Solving the L1 regularized problem . . . . . 70
Common numerical issues . . . . 70
Model evaluation . 72
End Notes . . . . . 73

7 Models Behaving Well

End Notes . . . . . 75

III Text Data
8 Processing Text

A Quick Introduction . . . . 77
Regular Expressions . . . . . 78
Basic Concepts . . . . 78
Unix Command line and regular expressions 79
Finite State Automata and PCRE . . . . . 82
Backreference . . . . . 83
Python RE Module 84
The Python NLTK Library . 87
The NLTK Corpus and Some Fun things to do . . . . 87

IV Classification
9 Classification

Quick Introduction . . . . 90
Naive Bayes . . . . 90
Smoothing 93
Measuring Accuracy . . . . . 94
Error metrics and ROC Curves . 94
Other classifiers . . 99
Decision Trees . . . . 99
Random Forest . . . . 101
Out-of-bag classification . . . . . 102
Maximum Entropy . . 103

V Extras
10 High(er) performance Python 

Memory hierarchy 107
Parallelism . . . . 110
Practical performance in Python . . . . 114
Profiling . . 114
Standard Python rules of thumb 117
For loops versus BLAS 122
Multiprocessing Pools 123
Multiprocessing example: Stream processing text files 124
Numba . . 129
Cython . . 129

DSC Resources

Services: Hire a Data Scientist | Search DSC | Classifieds | Find a Job
Contributors: Post a Blog | Ask a Question
Follow us: @DataScienceCtrl | @AnalyticBridge

Popular Articles

Difference between Machine Learning, Data Science, AI, Deep Learning, and Statistics
What is Data Science? 24 Fundamental Articles Answering This Question
Hitchhiker’s Guide to Data Science, Machine Learning, R, Python
Advanced Machine Learning with Basic Excel


Link: Free eBook: Applied Data Science (Columbia University)