0. Skim paper describing dataset building process: https://www.danielpovey.com/files/2015_ … speech.pdf
1. Table 1
2. 2.2-2.3 Alignment
3. 2.4 Data Segmentation
4. Table 2
1. General Questions/Concerns
2. Librispeech Dataset
a. Librivox v. Project Gutenberg
- https://librivox.org/
b. See paper here describing the build process of dataset: https://www.danielpovey.com/files/2015_ … speech.pdf
- See 2.4 Data Segmentation
- See Table 1
- See Table 2
- See 2.2-2.3 Alignment
2. Review of "Data Files"
a. splits
- train: `dev-clean`
- test: `test-clean`
- language model: `3-gram.pruned.3e-7.arpa`
b. audio
- 8 kHz v 16 kHz v flac
- ~20% reduction in performance with 8kHz v 16kHz (https://www.superlectures.com/odyssey20 … on-systems)
- listen to some samples
- male v. female
- listen to some samples
c. segmented v. unsegmented
- split on "pause"
- pause = silence for more than X seconds
- silence = no signal > Y dB
d. phones
- see http://www.speech.cs.cmu.edu/cgi-bin/cmudict
- silence phones
d. lexicon
- see http://www.speech.cs.cmu.edu/cgi-bin/cmudict
- find stressed v. unstressed examples
3. out-of-vocabulary (OOV)
- see http://www.speech.cs.cmu.edu/tools/lextool.html
- see https://github.com/sequitur-g2p/sequitur-g2p
- (`tmux session=sequitur` on desktop for demo)
4. What to expect next week
- resources, see schedule: https://docs.google.com/document/d/1pXt … mWpKbIeyqc
- 2.1
- in shell using IRSTLM (manual in resource_files)
- using a toy corpus
- "real" language model will be built in Week 3
- 2.2
- in python
- "exploratory" with "case study"
- 2.HW
- due before Week 3 class
Last edited by Michael Capizzi (2018-01-17 01:20:44)