Introduction to Data Science, Version 2.0 is a free e-book written by Jeffrey Stanton, Professor and Senior Associate Dean in the School of Information Studies at Syracuse University, and Robert de Graaf. It was developed for the Certificate of Data Science program at the School, and serves as an introduction to the key concepts of data science for non-technical readers. The book uses a hands-on learning approach, walking readers through data science concepts and tasks using the R language. It is suitable for anyone who wants to have a slightly technical appreciation of what data science is about, and an understanding of what data scientists do.
The book is distributed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 license, and may be downloaded free from jsresearch.net (22MB non-interactive PDF) or from iTunes (interactive, for iBooks 3 or later).
The book is organised into 17 chapters, each building on the concepts, techniques and code presented in the previous chapters. Here is an outline of the concepts taught in each chapter:
- Data Science: Many Skills — Introduction, the role of data scientists, key activities in data science (data architecture, data acquisition, data analysis, and data archiving) and skills that data scientists need
- Chapter 1: About Data — Definition of data
- Chapter 2: Identifying Data Problems — Eliciting needs and data problems from business users
- Chapter 3: Getting Started with R — Installing R and using the R console
- Chapter 4: Follow the Data — Data modelling
- Chapter 5: Rows and Columns — Structured data, R data frames, ranges and quartiles
- Chapter 6: Beer, Farms and Peas — Descriptive statistics: measures of central tendency, variance, standard deviation and histograms
- Chapter 7: Sample in a Jar — Statistical sampling, Law of Large Numbers, Central Limit Theorem, normal distribution and statistical tests
- Chapter 8: Big Data? Big Deal! — Characteristics of Big Data and considerations for using all data versus sampling
- Chapter 9: Onward with R-Studio — Basic usage of R-Studio and R functions
- Chapter 10: Tweet, Tweet! — Installing R packages, grabbing data from the Twitter API and data exploration
- Chapter 11: Popularity Contest — Poisson distribution, plotting charts in R and confidence intervals
- Chapter 12: String Theory — Text processing
- Chapter 13: Word Perfect — Text mining and word clouds
- Chapter 14: Storage Wars — Importing data from CSV, Excel, databases and Hadoop
- Chapter 15: MashUp — Geospatial analysis, geocoding and mapping
- Chapter 16: Line Up, Please (written by Robert de Graaf)— Linear models, linear regression
- Chapter 17: Hi Ho, Hi Ho – Data Mining We Go — Association rules mining using the a priori algorithm
Stanton uses familiar, everyday examples to explain data science concepts in a manner that is easy to understand and follow, even for readers who have no prior knowledge in statistics or computer science. The book introduces concepts only when they are required in each chapter, rather than topically (for example, everything about the R language in one chapter). This is both advantageous and disadvantageous to learning—on the one hand, readers are not bombarded with too much information at the start, making for an easier transition from one chapter to the next; on the other hand, readers need to learn concepts relating to different topics at the same time, such as, computer science, R programming, and statistics, which may get confusing when they get to the last few chapters.
The hands-on examples in R are built on real data that can be downloaded from the Internet (for example, U.S. census data or Twitter feeds), adding a sense of realism to make the learning experience more engaging. Readers can put themselves in the shoes of real data scientists who probably make use of the same data (and much more) and analyse the data in similar ways.
This book is easy to read and is great for anybody who wants a hands-on, concrete introduction to the world of data science. It is useful for business people who want to understand what their data science teams can do to solve business challenges and the processes and activities behind data science, provided they are willing to get their hands dirty with some coding and analysis. Aspiring data scientists who want to get a taste of what data science is like before committing more time and energy to pursue it will also find this book useful.