NLP - Natural Language Processing
Tokenization
First step in NLP: Tokenize words!
A text (collection/sentences of words) has to be broken down into tokens.
- Word tokens (i.e. splitting sentence by space character)
- Character tokens:
school:s+c+h+o+o+l - Subword tokens:
likelihood:like+li+hood
Vocabulary refers to the set of unique tokens in the corpus.
Word embeddings
Definition
Word embeddings are vectors of real numbers which are created by mapping words or phrases from a vocabulary to it (e.g. from a corpus (a large collection of texts)
Examples
spacy.io offers larger or smaller word embeddings.
Here for example are small, medium or large word-embeddings. In the small one just some words won’t be in it.
On this colab of a workshop I attended of responsibly.ai in November 2019.
Different kinds of word embeddings
Bag of words
Each word which occurs in a sentence gets a number which stands for the amount at which it occurs in the sentence.
Pro: Slim, simple representation Con: Order lost
1-time pad
Pro: Order Con: No relation between words, a lot of vectors, a lot of zeros!
Vectors - relating words
Pro: Relation between words is encoded in array of length 300 via the use of a Corpus.
- An illustration of Word2Vec
Gender Bias in word embeddings
AI and bias in recruiting applications
Unfortunately on a broad level there’s still discrimination of various sorts when it comes to selection of candidates - even if there are some positive examples of recruitment companies which actively work against it.
Because every artificial intelligence system is based on historic data, every AI in the field of recruiting, for example an automated CV screening system, is inherently full of #biases . So AI will only amplify such a discriminatory bias, not diminish it.
There are, however, #machinelearning methods to overcome the gender bias or racial bias of the data the algorithms are trained on. A few months ago I mentioned it in a different discussion here on LinkedIn: https://tinyurl.com/1ilvyktu
Shlomi Hod (he/him) offers a workshop ( https://learn.responsibly.ai/word-embedding/ ) in which he first exemplifies the gender bias inherent in natural language processing (an AI subdivision) systems and then shows how these may be overcome.
Since there are ways to alter the results of a machine learning algorithms to be less #biased the question of whether to implement such an alteration of results now becomes a social and political problem and is not solely a technical one anymore.
Genau wie du es sagst, kann die KI bloß so “intelligent” sein, wie es die Daten unserer Vergangenheit und Gegenwart hergeben. Diese Daten sind natürlich geprägt von Neigungen, von denen wir uns als Gesellschaft wegbewegen und nicht noch verstärken wollen.
Ich nahm vor etwa einem Jahr an einem wunderbaren Data Science Workshop “Exploring Gender Bias in Word Embeddings” von Shlomi Hod ( https://learn.responsibly.ai/word-embedding/ ) teil, bei dem man hands-on sieht, wie aktuelle Neigungen der Gesellschaft durch KI bloß noch verstärkt werden, aber es wird auch gezeigt, wie man solch einem Bias entgegenwirkt, um so eine “gute” menschliche Intelligenz zu erzeugen von der du sprichst. Sein Workshop richtet sich an jede*n (ob mit oder ohne technischen Hintergrund) und ich kann dir sein Workshop wärmstens empfehlen, um ein intuitives Verständnis von Gender Bias in Künstlicher Intelligenz zu erhalten. Auf der Homepage kann man seine nächsten Workshop Termine einsehen oder sich selbst durch den interaktiven Code lesen: https://colab.research.google.com/github/ResponsiblyAI/word-embedding/blob/master/tutorial-bias-word-embedding.ipynb
- Shlomi Hod gives the workshop “Workshop // Exploring Gender Bias in Word Embeddings”. This is the corresponding GitHub repo, the colab yupiter notebook and the workshop homepage.
- Here’s another article about gender bias in word embeddings.
Sentiment analysis
Discuss on Twitter ● Improve this article: Edit on GitHub