VISUALIZING TEXT. Petra Isenberg

Similar documents
VISUALIZING TEXT. Petra Isenberg

How to Submit Creative Design Files

no-more Newsletter Source: xxx

Subdomain Entry Vocabulary Modules Evaluation

UN Global Platform Mark Craddock. Technical

[EMC Publishing Note: In this document: CAT 1 stands for the C est à toi! Level One Second Edition Teacher s Annotated Edition of the Textbook.

November Final Report. Communications Comparison. With Florida Climate Institute. Written by Nicole Lytwyn PIE2012/13-04B

TEXAS DEPARTMENT OF TRANSPORTATION CASE ANALYSIS


Grade 5 English Language Arts

Recurrent neural network grammars. Slide credits: Chris Dyer, Adhiguna Kuncoro

Boosting Biomedical Entity Extraction by Using Syntactic Patterns for Semantic Relation Discovery

[Boston March for Science 2017 photo Hendrik Strobelt]

Title. Author. January 5, 2019

Grade 2 News. Agendas and Remind App. Welcome Back!! Miss Freund Phone:

Mayfield Tigers Extravaganza

Please initial and date as your child has completely mastered reading each column.

Kaja Kopitar & Smiljan Pintarič. Priročnik za uporabo celostne grafične podobe Univerze v Mariboru

Grade 5, Prompt for Opinion Writing Common Core Standard W.CCR.1

The Three Little Pigs By Joseph Jacobs 1890

Let s Talk Turkey Selection Let s Talk Turkey Expository Thinking Guide Color-Coded Expository Thinking Guide and Summary

Social Listening Among Pet Parents CASE STUDY

The online processing of semantic and pragmatic content

IMAGE CAPTIONING USING PHRASE-BASED HIERARCHICAL LSTM MODEL

@DEVONPERSING DESIGNING FOR ACCESSIBILITY

The Kaggle Competitions: An Introduction to CAMCOS Fall 2015

Logical Forms. Prof. Sameer Singh CS 295: STATISTICAL NLP WINTER February 16, 2017

Why should we care about biodiversity? Why does it matter?

parallel and nonparallel

Semantics. These slides were produced by Hadas Kotek.

Your web browser (Safari 7) is out of date. For more security, comfort and the best experience on this site: Update your browser Ignore

SUBJECT, SUBJECT + PREDICATE, PREDICATE USING COMPOUND SUBJECTS AND PREDICATES

Grade 3, Prompt for Opinion Writing

Released Items Grade 4 ELA-Reading AzMERIT

Explorers 3. Teacher s notes for the Comprehension Test: The Ugly Duckling. Answer key 1b 2a 3a 4c 5a 6b 7b 8c 9a 10c

The weekly passage discussed issues related to dog ownership. Here is some information that might be helpful to students less familiar the topic.

GET WRITING! Write your own WW1 newspaper article

Dog with a Blog. Elizabeth Crowe HON /5/2013

CS6501: Deep Learning for Visual Recognition. CNN Architectures

Perplexity of n-gram and dependency language models

Bella. Scholastic Short Reads Sample

Clever Monkey: A Folktale from Africa

LEARNING OBJECTIVES. Watch and understand a video about a wildlife organization. Watch and listen

Dynamic Programming for Linear Time Incremental Parsing

Chapter 6: Extending Theory

What kind of Theory do we need for English Syntax? Are languages finite? Could we list all the sentences of English?

Sanya s Science Report

time and the parasite had an easy fight through its immune system and eventually damaged it enough to the point of depletion of life.

VENTURA COLLECTION STONEWYNN COLLECTION COMO COLLECTION OUTDOOR MODULAR COLLECTIONS:

Campaign Communication Materials 18 November 2008

Happy hens. Teacher guidance - 1. Introduction. Project overview

Teacher Guide Teacher Answer Key and Kentucky Core Academic Standards for RPA 1 Grade 3

The Cat in the Hat. by Dr. Seuss. Teacher & Student Guide. by Heather Hall. Developing Thinkers. Research, Reason, Relate, Record. PAHS...

SAN ĠORĠ PRECA COLLEGE PRIMARY SCHOOLS. Half Yearly Exams Year 5 ENGLISH Time: 1 hour 15 minutes. Reading Comprehension, Language and Writing

St Margaret College Half Yearly Examinations Year 5 English Written Time: 1 Hour 15 Minutes. Name: Class:

JOBS. Cool. Warm Up. Discuss these questions with a partner. 1. What are the people in the photo doing? Would you like this job?

Moving toward formalisation COMP62342

Chapter 6: Extending Theory

Discussion and Activity Guide for. Orville: A Dog Story Written by Haven Kimmel, illustrated by Robert Andrew Parker

Titre du rapport éventuellement en plusieurs lignes

Objectives Students will use titles as an aid to predicting the main idea and supporting details of a passage. use a web to summarize.

Attributing the Bixby Letter: A case of historical disputed authorship

Introduction to Storytelling & Synthesis

Shared Humanity Written by Marilee Joy Mayfield

ENTRY CLERK MANUAL FOR THE ENTRYCLERK.CFA.ORG WEB APPLICATION. Page 1

Year end test English - Grade 11 Times 3 Hours

Getting Started! Searching for dog of a specific breed:

What is Parallel Structure?

ENGL-3 MMS Running on Water Quiz Exam not valid for Paper Pencil Test Sessions

Go, Dog. Go! PLAYGUIDE. The Story Dogs, dogs, everywhere! Big ones, little ones, at work and at play. The CATCO

Trapped in a Sea Turtle Nest

Comparing DNA Sequences to Understand Evolutionary Relationships with BLAST

Part4. Saint Fatima Language School Form 3 Second Term 2018 / The Vision of the School : Distinct Environment for Refined Education

English One Name Reading Test 2 (20 points) Man s Best Friend Just Got Better By Darwin Wigget, The Guardian, March 14, 2016

LABRADOR RETRIEVER: LABRADOR RETRIEVER TRAINING - COMPLETE LABRADOR PUPPY TRAINING GUIDE, OBEDIENCE, POTTY TRAINING, AND CARE TIPS (RETRIEV

Differentiated Activities for Teaching Key

Nathan A. Thompson, Ph.D. Adjunct Faculty, University of Cincinnati Vice President, Assessment Systems Corporation

Building Concepts: Mean as Fair Share

TITLE: Recognition and Diagnosis of Sepsis in Rural or Remote Areas: A Review of Clinical and Cost-Effectiveness and Guidelines

The World of. Ideas for exploring Gill Lewis s novel with pupils in Years 5, 6, 7 and 8 (P5, P6 and S1)

Big Dogs Little Dogs

English Language Arts

THE ARTICLE. New mammal species found

9 reasons why, the autobiography of,

Your web browser (Safari 7) is out of date. For more security, comfort and the best experience on this site: Update your browser Ignore

EVENTS OR STEPS The events in the story are the steps that the character takes to solve the problem or reach the goal.

Adaptations: Changes Through Time

DOWNLOAD OR READ : THE OTHER KITTEN PDF EBOOK EPUB MOBI

Novel Study Units By E. M. Warren

Writing: Lesson 31. Today the students will be learning how to write more advanced middle paragraphs using a variety of elaborative techniques.

اإلجابت على الورقت وفسها

Scratch. To do this, you re going to need to have Scratch!

FIRST TERM READING REVISION PAPER ENGLISH LANGUAGE GRADE 3

EDUCATION GUIDE HENRY AND MUDGE. Tuesday, April 10, :30am and 12:30pm

King Fahd University of Petroleum & Minerals College of Industrial Management

MSc in Veterinary Education

Caring and. sharing. We love Hong Kong. 2 Small houses News report. 3 Food in a basin Fun and games Description. 4 Computer Jobs Biography

Read this extract taken from Little Manfred by Michael Morpurgo and answer all the questions.

Connecting Literature and Math - Component of STEM Curriculum

ESL Writing & Computerized Accuplacer ESL (Reading, Listening, Language Use)

Transcription:

VISUALIZING TEXT Petra Isenberg

RECAP STRUCTURED DATA UNSTRUCTURED DATA

(TODAY) VISUALIZING TEXT

nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. Lorem ipsum dolor sit amet, TEXT consectetur IS adipisicing DIFFERENT elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi COMMON ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat UNSTRUCTURED cupidatat non proident, (MOSTLY) sunt in culpa qui officia deserunt mollit anim id est laborum. Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed HIGH-DIMENSIONAL do eiusmod tempor incididunt ut (10,000+) labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo BIG! consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum

WHY VISUALIZE TEXT?

WHY To assist information retrieval To enable linguistic analysis To augment analytics on mixed data Themescape Visual Thesaurus Thread Arcs

WHY VISUALIZE UNDERSTANDING: GET THE GIST OF A DOCUMENT TEXT? GROUPING: CLUSTER FOR OVERVIEW OR CLASSIFICATION COMPARE: COMPARE DOCUMENT COLLECTIONS, OR INSPECT EVOLUTION OF COLLECTION OVER TIME CORRELATE: COMPARE PATTERNS IN TEXT TO THOSE IN OTHER DATA, E.G., CORRELATE WITH SOCIAL NETWORK

WHAT IS TEXT DATA? DOCUMENTS ARTICLES, BOOKS AND NOVELS COMPUTER PROGRAMS E-MAILS, WEB PAGES, BLOGS TAGS, COMMENTS COLLECTION OF DOCUMENTS MESSAGES (E-MAIL, BLOGS, TAGS, COMMENTS) SOCIAL NETWORKS (PERSONAL PROFILES) ACADEMIC COLLABORATIONS (PUBLICATIONS) EVEN WHOLE LIBRARIES, WEBSITES, SOCIAL NETWORKS

DIFFICULT DATA Too much data what to use? Millions of blog posts, Hundreds of thousands of news stories, 183 billion emails,... per day Data is noisy: 70-72% of email is spam Text contains section headings, figure captions, and direct quotes.

ONCE YOU HAVE THE DATA... Most meaning comes from our minds and common understanding. How much is that doggy in the window? how much: social system of barter and trade (not the size of the dog) doggy implies childlike, plaintive, probably cannot do the purchasing on their own in the window implies behind a store window, not really inside a window, requires notion of window shopping (Hearst, 2006)

LANGUAGE IS AMBIGUOUS Words and phrases can have many meanings, determined by context and world knowledge. Interesting language is often figurative: Tables encourage casual interaction. vs I encouraged her to take a day off.

LANGUAGE IS AMBIGUOUS I saw Pathfinder on Mars with a telescope. Pathfinder photographed Mars. The Pathfinder photograph mars our perception of a lifeless planet. The Pathfinder photograph from Ford has arrived. The Pathfinder forded the river without marring its paint job. (Hearst, 2006)

VISUAL CONSIDERATIONS Supporters of Martin, who has been jailed without trial for more than two years, are calling on Prime Minister Stephen Harper to ask Mexican president Felipe Calderon to release Martin text is not preattentive under a section of the Mexican constitution that allows the government to expel undesirables from the country. Martin's supporters believe she has no chance of a fair trial in Mexico. Neither does Waage.

VISUAL CONSIDERATIONS Supporters of Martin, who has been jailed without trial for more than two years, are calling on Prime Minister Stephen Harper to ask Mexican president Felipe Calderon to release Martin text is not preattentive under a section of the Mexican constitution that allows the government to expel undesirables from the country. Martin's supporters believe she has no chance of a fair trial in Mexico. Neither does Waage.

VISUAL CONSIDERATIONS

VISUAL CONSIDERATIONS Text readability is dependent on size, orientation, font, clutter... More likely to need large amounts of text in language visualization

VISUALIZING LANGUAGE IS ALSO EASY! SO much data available for analysis (Mostly) readily computer readable Simple techniques can give instant summaries

OUTLINE TEXT AS DATA VISUALIZING DOCUMENT CONTENT EVOLVING DOCUMENTS DOCUMENT COLLECTIONS

TEXT AS DATA

Words are the basic unit of data.

WORD-LEVEL ATTRIBUTES WORD LENGTH PART OF SPEECH (NOUN, VERB, ADJECTIVE, ETC.) FORMAT (ITALIC, UNDERLINE, ETC.) LANGUAGE (ENGLISH? LATIN? JAPANESE?) FREQUENCY / DIFFICULTY (IS IT COMMON?) SENTIMENT (POSITIVE OR NEGATIVE CONNOTATION) SYNONYMS / ANTONYMS / ETYMOLOGY (OTHER MEANINGS? ROOTS?) ENTITIES (e.g. Calgary, Obama, Telus ) AND MANY MORE

AGGREGATION REPETITION PLAGARISM SHARED ENTITIES AUTHOR STYLE COLLECTION DOCUMENT SECTION PAGE PARAGRAPH SENTENCE WORD TENSE SENTIMENT SENTENCE LENGTH READING

LINGUISTIC METHODS Word Counting Word Scoring Stemming Stop Word Removal Part of Speech Tagging Parsing Word Sense Disambiguation Named Entity Recognition Semantic Categorization Sentiment Analysis Topic Modeling (some caveats)

WHAT ABOUT THESE WORDS? automate automates automatic automation automat a, an, the, to, New York Ban Ki-moon Manchester Unitd United

STEMMING Reduce words to their stems by removing endings (morphology) running -> run runs -> run A good way to increase signal and reduce fracturing of the corpus if there aren t many words. Note: Keep the original words somewhere! Also keep the case if you choose to lowercase the word; you never know when you ll need this data

STOP WORD REMOVAL Common words such as and, the, I are removed from view to highlight content words Domain specific stop words, e.g. in legal domain: Court, attorney, honour, plaintiff, etc. Caution! These words have been shown to be useful for stylistic analysis! When working with text corpora, KEEP EVERYTHING.

NAMED ENTITY RECOGNITION What are the people, places in the text? Use NLTK it s very good at this. http://vialab.science.uoit.ca/docuburst

TEXT PROCESSING PIPELINE TOKENIZATION: SEGMENT TEXT INTO TERMS ENTITIES? SAN FRANCISCO, O CONNOR, U.S.A. REMOVE STOP WORDS? A, AN, THE, TO, BE N-GRAMS? CAN TAKE WORDS IN 2-WORD GROUPS (BI-GRAMS), 3-WORD (TRI-GRAMS), ETC. STEMMING: GROUP TOGETHER DIFFERENT FORMS ROOTS: VISUALIZATION(S), VISUALIZE(S), VISUALLY VISUAL LEMMATIZATION: GOES, WENT, GONE GO FOR VISUALIZATION, SOMETIMES NEED TO REVERSE STEMMING FOR LABELS SIMPLE SOLUTION: MAP FROM STEM TO THE MOST FREQUENT WORD RESULT: ORDERED STREAM OF TERMS

TEXT PROCESSING The PIPELINE quick brown fox jumps over the lazy dog. TOKENIZE (N=1) [The], [quick], [brown], [fox], [jumps], [over], [the], [lazy], [dog]. TOKENIZE (N=1), REMOVE STOPWORDS, STEM [quick], [brown], [fox], [jump], [over], [lazy], [dog] TOKENIZE (N=2) [the quick], [quick brown], [brown fox], [fox jumps], [jumps over], [over the] TOKENIZE (N=5) [the quick brown fox jumps], [quick brown fox jumps over], [brown fox jumps over

NLTK (NATURAL LANGUAGE TOOLKIT) NLTK.org Python

VISUALIZING DOCUMENT CONTENT

BUT FIRST SOME SKETCHING

SKETCHING: VISUALIZE IMAGINE DISSERTATIONS YOU HAVE 20 YEARS OF UNIVERSITY PH.D. THESES: YEAR DEPARTMENT AUTHOR ADVISOR COMMITTEE COMPLETE TEXT TASK: 1) VISUALIZE THE MOST IMPORTANT CONTENT FROM A SINGLE THESIS. 2) VISUALIZE HOW SIMILAR THESES FROM EACH DEPARTMENT ARE TO THESES FROM OTHER DEPARTMENTS. GROUPS OF 3 (~10 MINUTES)

EXAMPLE THESIS WESLEY WILLETT

TAG CLOUDS WORD COUNT http://tagcrowd.com/ THESIS WESLEY WILLETT

TAG CLOUDS WORD COUNT www.jasondavies.com/wordcloud/ THESIS WESLEY WILLETT

WHAT S PROBLEMS DO YOU SEE WITH TAG CLOUDS?

TAG CLOUDS STRENGTHS CAN HELP WITH GISTING AND INITIAL QUERY FORMATION. WEAKNESSES SUB-OPTIMAL VISUAL ENCODING (SIZE VS. POSITION) INACCURATE SIZE ENCODING (LONG WORDS ARE BIGGER) MAY NOT FACILITATE COMPARISON (UNSTABLE LAYOUT) ORDER USUALLY MEANINGLESS (USUALLY ALPHABETICAL OR RANDOM) TERM FREQUENCY MAY NOT BE MEANINGFUL DOES NOT SHOW THE STRUCTURE OF THE TEXT

WORD COUNTS

WORDCOUNT http://wordcount.org JONATHAN HARRIS

CONCORDANCE WHAT IS THE COMMON LOCAL CONTEXT OF A TERM?

WORD TREES cats are better than dogs cats eat kibble cats are better than hamsters cats are awesome cats are people too cats eat mice cats meowing cats in the cradle cats eat mice cats in the cradle lyrics cats eat kibble cats for adoption cats are family cats eat mice cats are better than kittens cats are evil cats are weird cats eat mice WATTENBERG & VIÉGAS 2008

FILTER INFREQUENT RUNS

WORDSEER MURALIDHARAN & HEARST

RECURRENT THEMES IN SPEECH

GLIMPSES OF STRUCTURE CONCORDANCES SHOW LOCAL, REPEATED STRUCTURE BUT WHAT ABOUT OTHER TYPES OF PATTERNS? FOR EXAMPLE LEXICAL: <A> at <B> SYNTACTIC: <Noun> <Verb> <Object>

PHRASE NETS LOOK FOR SPECIFIC LINKING PATTERNS IN THE TEXT: A AND B, A AT B, A OF B, ETC COULD BE OUTPUT OF REGEXP OR PARSER VISUALIZE EXTRACTED PATTERNS IN A NODE-LINK VIEW OCCURRENCES = NODE SIZE PATTERN POSITION = EDGE DIRECTION van Ham et al

X and Y PORTRAIT OF THE ARTIST AS A YOUNG MAN JAMES JOYCE

NODE GROUPING

THE BIBLE {X} begat {Y}

18 TH & 19 TH CENTURY NOVELS {X} s {Y}

OLD TESTAMENT {X} of {Y}

X of Y NEW TESTAMENT {X} of {Y}

RHYME, SPEECH, ETC. POEMAGE McCurdy et al. 2016

REVISIT YOUR SKETCHES? TASK: 1) VISUALIZE THE MOST IMPORTANT CONTENT FROM A SINGLE THESIS. ARE YOUR VISUALIZATION CHOICES EFFECTIVE? DOES THE VIS CAPTURE THE LENGTH, FORM, AND POSITION OF THE IMPORTANT CONTENT? DO YOU SHOW OR CONNECT BACK TO THE ORIGINAL TEXT?

EVOLVING DOCUMENTS

VISUALIZING REVISION HISTORY HOW TO DEPICT CONTRIBUTIONS AND CHANGES OVER TIME?

DIFF

WIKIPEDIA HISTORY FLOW VIÉGAS ET AL 2004

ANIMATED TRACES http://fathom.info/traces BEN FRY

REVISION HISTORY Visualizing traces http://benfry.com/traces/

DIFFAMATION

VISUALIZING DOCUMENT COLLECTIONS

newsmap.jp

DOCUMENT CARDS SMALL MULTIPLES FOR DOCUMENTS

THEMERIVER HAVRE ET AL 1999

PARALLEL TAG CLOUDS COLLINS ET AL. 09

SUPPORTING SEARCH TileBars Hearst 1999

SeeSoft Eick 199

NAMED ENTITY RECOGNITION IDENTIFY AND CLASSIFY NAMED ENTITIES IN TEXT: JOHN SMITH IS A PERSON SOVIET UNION IS A COUNTRY 2500 UNIVERSITY DR IS AN ADDRESS (555) 867-5309 IS A PHONE NUMBER ENTITY RELATIONS: HOW DO THE ENTITIES RELATE? DO THEY CO-OCCUR IN A DOCUMENT? IN A SENTENCE?

JIGSAW

CENDARI NOTE-TAKING ENVIRONMENT 2015

DOCUMENT SIMILARITY & CLUSTERING COMPUTE SIMILARITY BETWEEN DOCUMENTS BASED ON THE WORDS THEY SHARE TF-IDF (TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY) IS COMMON TOPIC MODELING APPROACHES ASSUME DOCUMENTS ARE A MIXTURE OF TOPICS TOPICS ARE (ROUGHLY) A SET OF CO-OCCURRING TERMS LATENT SEMANTIC ANALYSIS (LSA): REDUCE TERM MATRIX MANY, MANY APPROACHES EXIST

STANFORD DISSERTATION BROWSER CHUANG, RAMAGE, MANNING & HEER 2012

STANFORD DISSERTATION BROWSER

WARNING OFTEN, TEXT VISUALIZATIONS DO NOT REPRESENT TEXT DIRECTLY, BUT THEY REPRESENT A MODEL WORD COUNTS, WORD SEQUENCES, CLUSTERS, ETC. ASK: CAN YOU INTERPRET THE VISUALIZATION? DOES THE MODEL ACCURATELY REPRESENT THE ORIGINAL TEXT?

LESSONS FOR TEXT VISUALIZATION SHOW SOURCE TEXT (OR PROVIDE ACCESS TO IT) WHERE POSSIBLE, USE VISUALIZATION AS INDEX INTO DOCUMENTS GROUP DOCUMENTS IN MEANINGFUL WAYS WILL VIEWERS UNDERSTAND THE CLUSTERS? WHERE POSSIBLE USE TEXT TO REPRESENT TEXT

HUNDREDS OF TOOLS & TECHNIQUES FOR TEXT AT http://textvis.lnu.se/

QUESTIONS?

ACKNOWLEDGEMENTS Slides in were inspired, adapted, taken from slides by Christopher Collins (University of Ontario Institute of Technology) Wesley Willett (University of Calgary)