Class Details

Price: $1,500

Course Includes:

  • Class exercises in addition to training instruction
  • Courseware books, notepads, pens, highlighters and other materials
  • Full breakfast with variety of bagels, fruits, yogurt, doughnuts and juice
  • Tea, coffee, and soda available all day
  • Freshly baked cookies every afternoon - * only at participating locations

This four-day course teaches data scientists and analysts how to work with text data in R, how to classify documents and how to summarize bodies of text.

Course Outline

Module 1: Introduction to Text Mining

  • Lesson 1: Commercial applications of text mining
  • Lesson 2: Scraping data from the web
  • Lesson 3: Working with various APIs to retrieve text data
  • Lesson 4: Working and storing text corpora saving content and relevant meta data

Module 2: Cleaning Text Data

  • Lesson 1: Cleaning text: case conversion, punctuation removal, stemming, stop word removal, etc.
  • Lesson 2: Working with Term-Document/Document-Term matrices
  • Lesson 3: Text tokenization into n-grams and sentences

Module 3: Analyzing Text Data

  • Lesson 1: Bag of words: making word and n-gram clouds, comparison clouds and frequency bar charts
  • Lesson 2: Analyzing word and n-gram frequency distributions
  • Lesson 3: Application of bag of words: automatic text summarization using simplified and true Luhn's algorithms

Module 4: Document Clustering, Classification and Topic Modeling

  • Lesson 1: Document clustering and pattern mining (hierarchical clustering, k-means, clustering, etc.)
  • Lesson 2: Comparing and classifying docuents using TFIDF, Jaccard and cosine distance measures

Module 5: Identifying Important Text Elements

  • Lesson 1 Reducing dimensionality: Principal Component Analysis, Singular Value Decomposition non-negative matrix factorization
  • Lesson 2: Topic modeling and information retrieval using Latent Semantic Analysis

Module 6: Entity Extraction, Sentiment Analysis and Advanced Topic Modeling

  • Lesson 1: Positive vs. negative: degree of sentiment
  • Lesson 2: Item Response Theory
  • Lesson 3: Part of speech tagging and its application: finding people, places and organizations mentioned in text
  • Lesson 4: Advanced topic modeling: Latent Dirichlet Allocation


At the conclusion of this course, participants will be able to do the following:

  • Import, clean and parse various types of text data with R
  • Automatically summarize text
  • Identify key elements in the text data
  • Classify documents
  • Appy topic models to understand themes and measure sentiment