For Chunking, Named Entity Extraction, POS Tagging:- CRF++, HMM
CRF
- Feature Functions in a CRF
In a CRF, each feature function is a function that takes in as input:- a sentence s
- the position i of a word in the sentence
- the label $l_i$ of the current word
- the label $l_{i−1}$ of the previous word
and outputs a real-valued number (though the numbers are often just either 0 or 1).
assign each feature function fj a weight λj (I’ll talk below about how to learn these weights from the data). Given a sentence s, we can now score a labeling l of s by adding up the weighted features over all words in the sentence:
$$score(l|s) = \sum{j=1}^m\sum{i=1}^n\lambda_j f_j(s, i, li, l{i-1})$$
(The first sum runs over each feature function j, and the inner sum runs over each position i of the sentence.)
Ref: A very good post: Introduction to Conditional Random Fields
HMM (Hidden Markov model)
$$p(l|s) = p(li|l{i-1})p(w_i|l_i)$$
$p(li|l{i−1})$ are transition probabilities (e.g., the probability that a preposition is followed by a noun);
$p(w_i|l_i)$ are emission probabilities (e.g., the probability that a noun emits the word “dad”).
Background knowledges:
- A Markov chain is “a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event.”
- the term Markov property refers to the memoryless property of a stochastic process.
- The hidden Markov model can be represented as the simplest dynamic Bayesian network.
- In simpler Markov models (like a Markov chain), the state is directly visible to the observer, and therefore the state transition probabilities are the only parameters, while in the hidden Markov model, the state is not directly visible, but the output (in the form of data or “token” in the following), dependent on the state, is visible. Each state has a probability distribution over the possible output tokens. Therefore, the sequence of tokens generated by an HMM gives some information about the sequence of states;
Name Entity Exraction
Full named-entity recognition is broken down as two distinct problems: detection of names, and classification of the names by the type of entity they refer to (e.g. person, organization, location and other[7]). The first phase is typically simplified to a segmentation problem: names are defined to be contiguous spans of tokens, with no nesting, so that “Bank of America” is a single name, disregarding the fact that inside this name, the substring “America” is itself a name.
Concepts:
- Chunking: an analysis of a sentence which first identifies constituent parts of sentences (nouns, verbs, adjectives, etc.) and then links them to higher order units that have discrete grammatical meanings (noun groups or phrases, verb groups, etc.).
Approaches
Measurement
problems of token-level accruacy:
- the vast majority of tokens in real-world text are not part of entity names as usually defined, so the baseline accuracy (always predict “not an entity”) is extravagantly high, typically >90%
- mispredicting the full span of an entity name is not properly penalized (finding only a person’s first name when their last name follows is scored as ½ accuracy).
a variant of the F1 score
- Precision is the number of predicted entity name spans that line up exactly with spans in the gold standard evaluation data. I.e. when [Person Hans] [Person Blick] is predicted but [Person Hans Blick] was required, precision for the predicted name is zero. Precision is then averaged over all predicted entity names.
- Recall is similarly the number of names in the gold standard that appear at exactly the same location in the predictions.
- F1 score is the harmonic mean of these two.
POS Tagging
Word Alignment in Machine translation :- Maxent
Spell Checker:- Edit Distance, Soundex
Parsing:- CKY algorithm and other chart parsing algorithms
Document Classification:- SVM, Navie bayes
Anaphora Resolution:- Hobbs Algo, Lippin and Leass algo, Centering Theory
Topic Modeling and keyword extraction:- LDA, LSI