Articker Technology
Keywords: Media-Index, Named-Entity Recognition (NER), Text Matching, NLP, Name Disambiguation, Deep Learning, Text Classification
Articker is the art world’s analytics engine. It is built on top of the largest index of millions of online publications covering the visual arts over the past decades. Access to the Articker database and Articker’s real-time analytics is currently exclusive to Phillips specialists.
Media Index
Articker’s key concept is the media index. The media index measures each artist’s presence in the media over time. It is a function of several parameters: a) the weight of an artist’s mention in an article, b) the weight of the publisher of the article, c) the length of the article and d) the type of article. Is it a review or just information about an exhibition? The weight of an artist’s mention reflects the level of centrality of an artist to a given article. For example, does an article feature the artist? Or is the reference shallow, possibly listing the artist among many others? An artist name appearing in an article’s headline signifies the deepest relevance of an artist’s mention. An artist may also be central to the article without appearing in the headline. To determine the level of relevance of an artist, we use a range of techniques from information retrieval to Natural Language Processing (NLP).
The weight of publishers reflects how respected and referenced they are in the art literature. Articles by highly weighted domains such as The New York Times, The Guardian, Hyperallergic, and Artnet News count much more, for example, than university newspapers like the Daily Bruin or the Yale Daily News. Finally, the article types are determined by our classification system which is continuously adjusted using machine learning methods.
The media index of an artist is the sum of scores from all of the articles they’ve been mentioned in.
Finally the media share of an artist is defined as the ratio of media index of the artist to the sum of media index values of all artists. Media share is analogous to the well known business concept of market share.
Our media index is one of many possible functions which could successfully measure the artist’s media presence. The following properties have to be preserved:
- The more respected publishers should have higher weight
- The more central presence of an artist in an article should be rewarded by a higher media presence score for this article and this artist
Aging of the media index function, which penalizes older publications can also be an important option. This is not implemented in the Articker system yet, but we can consider it as well.
Plot 1 shows the distribution of the media index of the top 1000 artists. We can see the classical power law distribution. Picasso has the highest media index of over 480,000, followed by Andy Warhol, Van Gogh and Banksy, all of them with a media index exceeding 200,000. They are followed by Leonardo Da Vinci, Rembrandt, Ai Weiwei, Damien Hirst, Basquiat and Jeff Koons, all with a media index above 100,000.
Top 100 artists have media index values above 20000 and top 1000 above 3500 as shown on Plot 1.
The second plot shows the distribution of top K% of media indexes for all artists. Here we can see classical long tails. For example, the top 10% of all artists have a media index above 380 and the top 20% above 150. The median index score is around 30. There are around 7000 artists with a media index value between 25 and 35! A media index of 35 may be the result of an artist mentioned in up to 60 publications or 16 headlines in more minor publications- which is still quite a significant presence. A media index of 30 is several orders of magnitude smaller than the media index of blue chip artists in the top 100, as seen in plot 1. Blue chip artists may appear in tens, if not hundreds of thousands of articles and up to 10,000 headlines.
Now let us discuss media-share gain.
Plot 3 is the distribution of the 90-day media-share gain for artists with a media index score near 500.
Media-share gain reflects the change of media-share of an artist relative to other artists. This change can be positive (artist gaining media-share) or negative (artist losing media-share). Notice that while the media index never goes down due to its cumulative nature, the media-share may go up and down. If the media index of an artist does not move along with the total media index of all artists, the artist media index is considered to be lagging behind and the gain turns into a loss.
It is generally harder for artists with large media indexes, i.e. top artists to achieve high gains in media share. The lower the media index of an artist, the easier it is to gain media share.
Plot 3 shows distribution of media share gains for artists with a media index around 490 (i.e. between 480 and 500) – the media index value of Danielle McKinney, the leader of out Top 10 Momentum lists for both June and July. Again the distribution of these 90-day gains is following the power law. We can see that her gains of 150% and 270% respectively are the highest in that cohort of artists.
Plot 4 shows the cumulative media share of Top K artists for all values of K=1…1000. As expected from the power law distribution of media index values (plots 1 and 2), media share of top 1000 artists is almost reaching 50% (46% to be exact). Media share of Top 100 artists is almost 21% and the media share of top 10 artists is nearly 8%. Thus, small number of top artists contribute disproportionately high media share. Interestingly, even though the median of all media index values is around 35, the cumulative media share of all artists with media index value above 35 is around 99%. Thus, the lower ranked half of all the artists (the long tail) contributes only 1% of the media share.
Top momentum lists
Every month we publish the list of top 10 artists by momentum. Momentum is defined as gain of media share over the past 90 days. Some artists continue their momentum over consecutive months. For example Danielle McKinney has led top momentum lists for both June and July. Top lists for two consecutive months different by the latest 30 days for the later list and by dropping the first 30 days of the earlier list. Thus, for the July list, the month of March is dropped and replaced by the month of July (from June 15 till July 15th). This would penalize artists whose major media coverage took place in March but did not continue through the period of late June and early July. These lists truly reflect momentum and To stay on consecutive lists require prolonged momentum. This is why in our Momentum rankings we normalize the gains by effectively penalizing larger gains obtained from smaller “base” media index value.
Our monthly lists simply identify artists who are currently breaking out, relatively to their prior media index value.
Articker involves many state of the art technologies such as named-entity recognition (NER), name disambiguation, content classification, relevance and ranking. Articker matches hundreds of thousands of artist names against millions of online documents to construct media indexes for each artist [29]. The media index reflects each artist’s media presence, which is dependent on the weights of the individual domains, as well as the relevance of each article to an artist. Is the article truly featuring the given artist or is it just a shallow reference? We will enumerate and discuss the main challenges Articker faces along with the relevant state of the art technologies. In general, despite decades of research in information retrieval, NLP and machine learning (including most recent progress in deep learning), there are still no “plug-in” solutions that can be directly applied to Articker.
Recently, significant progress has been made in text processing through word2vec approaches and word embeddings as in [14] and [10] and [11]. Word embedding is the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers. The term word2vec literally translates to word to vector. The most important feature of word embeddings is that similar words in a semantic sense have a smaller distance (either Euclidean, cosine or other) between them than words that have no semantic relationship. For example, words like “painter” and “sculptor” should be closer together than the words “painter” and “basketball” or “sculptor” and “hip hop”. Word embeddings are created using a neural network with one input layer, one hidden layer and one output layer. Word embeddings and word2doc techniques are helpful in all major problems which Articker faces building its media indexes.
Below we enumerate the major challenges we face.
Generally, we would like to match artist names with articles by differentiating between shallow and deep presence of an artist name in the text. The classical approach is of course TF-IDF as per [26]-[28]. The TF-IDF value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general. TF-IDF is one of the most popular term-weighting schemes today. Thus, TF-IDF is a start for establishing the relevance of artist names in a given article. Much more information is needed including the article’s structure, context (word embedding) and even document classification.
The core question is whether the article is truly about the given artist, or is the artist name just mentioned there tangentially? (so called shallow reference). This is related to the concept of deep relevance in information retrieval literature [1] and [2]. Clearly, if the name of an artist is mentioned in the headline of an article, it is central to the article. But there are many articles which do not mention the artist in the headline, nevertheless are featuring that artist. The title may just refer to the artist indirectly, through alternative descriptions like “leader of constructivism”, or “emerging star of abstract painting”. In fact the “central” name may not even feature in the first paragraph of an article. NLP techniques such as anaphora resolution [24], [25] are critical to establish the strength of name presence in the text. Word embeddings as in word2vec [14] as well as [10] and [11] are promising.
NER (Named-Entity Recognition) as in [3]-[7] is critical in Articker. However, today’s NER is highly insufficient for our purposes. Our goal is to continuously discover new artist names in the corpus of text, thousands of new articles daily. Unfortunately tools like [6] and [7] are insufficient in both precision and recall. As it is well known pretty much any unigram, bigram, and trigram can be a name. Names of artists can be particularly inventive – like “Mr.” or “American Artist”. Perfect recall would be to consider any n gram for n<4 as a possible artist name. Precision of course would be totally unacceptable. Again word embeddings as in word2vec [14] are needed.
In general, the most difficult problem is to discover names of new artists, who have not yet been covered online. Since we also crawl gallery web sites as well as lists of MFA and BFA graduates of art schools, we may consider this information as an additional signal to determine if an n-gram is likely to be a name. Techniques to extract names from emails as in [4] can be relevant here as well. In general, it is the linguistic context which provides critical information to help distinguish a name from any n-gram.
Our data sources are largely limited to art media – but we also index broader media outlets, not to mention global news outlets which are not focused on art. Thus, content classification plays a major role in Articker. This is particularly important in name disambiguation. There are some popular names, such as David Smith (the famous American sculptor), who are dominantly identified as artists. But for many popular names, from famous celebrities to athletes, overmatching can lead to irrelevant articles and false media index values. For example, Tadej Pogačar is a top cyclist in the world but also quite an accomplished artist, http://act.mit.edu/about/people/tadej-pogacar. Chris Martin is the famous lead singer of Coldplay but also a well accomplished abstract painter https://en.wikipedia.org/wiki/Chris_Martin_(artist). There are also many cases of multiple roles. For example, a visual artist who is also a poet, filmmaker or even a politician (guess who could this be?). In such cases it is important to distinguish in which role an artist appears in a given article. Accounting for that role is very important since again it can lead to overmatching and inflation of media index values.
Content classification can be very useful to disambiguate and avoid false matching of artist names. Thus classifying articles into categories such as sports, music, movies and visual arts is critical specially for the popular names. References [12] and [13] are particularly relevant here. We have had some great experiences using SVM for text classification. We have also developed our own techniques to help classify artists into art categories such as painters, sculptors, visual artists, performance artists, photographers etc. We have pending patents describing our approach.
As we mentioned earlier, name disambiguation is critical for Articker due to two reasons: popular names and aliases. We have already discussed popular names earlier. Aliases such as Mr., JR, American Artist or Futura are also a challenge. Our early approaches led to sometimes dramatic overmatching. Currently, we use techniques which are coupled with content classification as well as [17]-[23]. There is a huge body of work targeted at author name disambiguation for the purpose of correct attribution of citations indexes.
As we mentioned before, the media index of an artist depends heavily on the importance of the domain of the matching article. These domains have to be weighted. Certainly The New York Times should carry more weight in the media index of an artist than, say, e-artnow.org. This is the problem targeted by PageRank in [30] and the foundational success of Google as a search engine. The power or weight of the domain depends on other domains linking to it.
We are using a form of page rank in our weighting of almost 16,000 domains, which we index in our Articker media index. We have established a core set of highly respected reference domains, which account for the domain weight. Links from core domains build the given domain weight. The more links from core domains which point to the given domain, the higher the domain weight.
1. Jiafeng Guo, Yixing Fan, Qingyao Ai, Bruce Croft [2016] “A Deep Relevance Matching Model for Ad-hoc Retrieval “
CIKM ’16: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, October 2016 Pages 55–64
2. Frankie Patman, Paul Thompson [2003] “Names: A New Frontier in Text Mining”
International Conference on Intelligence and Security InformaticsISI 2003: Intelligence and Security Informatics pp 27-38| Cite
3. Christopher D Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J Bethard, and David McClosky [2014]. The stanford core nlp natural language processing toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 55–60.
4. Einat Minkov, Richard C Wang, and William W Cohen [2005]. Extracting personal names from email: Applying named entity recognition to informal text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 443–450.
5. Erik F Sang and Fien De Meulder [2003] Introduction to the CoNLL-2003 shared task:Language-independent named entity recognition. arXiv preprint cs/0306050 (2003).
6. SpaCy https://spacy.io/api/entityrecognizer
7. https://nlp.stanford.edu/software/CRF-NER.html (Stanford NER)
8. Bengio, Yoshua; Ducharme, Réjean; Vincent, Pascal; Jauvin, Christian [2003]. “A Neural Probabilistic Language Model” (PDF). Journal of Machine Learning Research. 3: 1137–1155.
9. Yoon Kim [2014] Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882
10. Mikolov, Tomas; et al. [2013]. “Efficient Estimation of Word Representations in Vector Space”. arXiv:1301.3781[cs.CL].
11. Mikolov, Tomas [2013]. “Distributed representations of words and phrases and their compositionality”. Advances in Neural Information Processing Systems. arXiv:1310.4546.
12. Armand Joulin, Edouard Grave, Piotr Bojanowski, Tomas Mikolov [2016]. “Bag of Tricks for Efficient Text Classification”. Armand Joulin, Edouard Grave, Piotr, arXiv:1607.01759 [cs.CL]
13. Joachims, Thorsten [1998]. “Text categorization with Support Vector Machines: Learning with many relevant features”. Machine Learning: ECML-98. Lecture Notes in Computer Science. Springer. 1398: 137–142. doi:10.1007/BFb0026683. ISBN 978-3-540-64417-0
14. Goldberg, Yoav; Levy, Omer [2014]. “word2vec Explained: Deriving Mikolov et al.’s Negative-Sampling Word-Embedding Method”. arXiv:1402.3722 [cs.CL].
15. Bengio, Yoshua; Ducharme, Réjean; Vincent, Pascal; Jauvin, Christian [2003]. “A Neural Probabilistic Language Model” (PDF). Journal of Machine Learning Research. 3: 1137–1155.
16. Yoon Kim [2014]. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882H.
17. Han, L. Giles, H. Zha et al. [2004], “Two supervised learning approaches for name disambiguation in author citations,” in Proceedings of the ACM/IEEE Joint Conference on Digital Libraries, pp. 296–305, Ontario Canada, June 2004. Y. Chen and J. H. Martin.
18. Y Chen [2007] “Towards robust unsupervised personal name disambiguation,” Empirical Methods in Natural Language Processing, vol. 80, pp. 190–198, 2007.
19. W. Zhang [2019], Research on Disambiguation of Authors with the Same Name in Literature Database, Shandong University, Jinan, China, 2019.
20. Y. Zhang, F. Zhang, P. Yao et al. [2018], “Name disambiguation in AMiner: clustering, maintenance, and human in the loop,” Knowledge Discovery and Data Mining, vol. 18, pp. 1002–1011, 2018.
21. Jie Tang; A.C.M. Fong; Bo Wang; Jing Zhang [2012]. “A Unified Probabilistic Framework for Name Disambiguation in Digital Library”. IEEE Transactions on Knowledge and Data Engineering. IEEE. 24(6): 975–987.
22. Xuezhi Wang; Jie Tang; Hong Cheng; Philip S. Yu [2011]. ADANA: Active Name Disambiguation. Proceedings of 2011 IEEE International Conference on Data Mining. Vancouver: IEEE. pp. 794–803.
23. Zhang, Ziqi; Nuzzolese, Andrea; Gentile, Anna Lisa [2017]. Entity Deduplication on Scholarly Data. Proceedings the Extended Semantic Web Conference. Springer-Verlag. pp. 85–100
24. Shalom Lappin, Herbert J. Leass [1994] “An algorithm for pronominal anaphora resolution”Computational Linguistics Volume 20 Issue 4 December 1994 pp 535–561
25. R Mitkov [2014]. “Anaphora Resultion’ Google Books.
26. Juan Ramos [2003]. Using TF-IDF to Determine Word Relevance in Document Queries
27. Salton, G.; Buckley, C. [1988]. “Term-weighting approaches in automatic text retrieval” (PDF). Information Processing & Management. 24 (5): 513–523.
28. Wu, H. C.; Luk, R.W.P.; Wong, K.F.; Kwok, K.L. [2008]. “Interpreting TF-IDF term weights as making relevance decisions”. ACM Transactions on Information Systems. 26 (3)
29. Articker [2021], “Finding promising artists with Articker” articker.org (papers)
30. Page, Lawrence and Brin, Sergey and Motwani, Rajeev and Winograd, Terry [1999] “The PageRank Citation Ranking: Bringing Order to the Web” Technical Report. Stanford InfoLab.