VISUALIZING TEXT. Petra Isenberg

Similar documents
VISUALIZING TEXT. Petra Isenberg

How to Submit Creative Design Files

no-more Newsletter Source: xxx

TEXAS DEPARTMENT OF TRANSPORTATION CASE ANALYSIS

Subdomain Entry Vocabulary Modules Evaluation

UN Global Platform Mark Craddock. Technical

Recurrent neural network grammars. Slide credits: Chris Dyer, Adhiguna Kuncoro

[EMC Publishing Note: In this document: CAT 1 stands for the C est à toi! Level One Second Edition Teacher s Annotated Edition of the Textbook.

Title. Author. January 5, 2019

Grade 2 News. Agendas and Remind App. Welcome Back!! Miss Freund Phone:

Mayfield Tigers Extravaganza

Grade 5, Prompt for Opinion Writing Common Core Standard W.CCR.1

Kaja Kopitar & Smiljan Pintarič. Priročnik za uporabo celostne grafične podobe Univerze v Mariboru


Grade 5 English Language Arts

Dynamic Programming for Linear Time Incremental Parsing

parallel and nonparallel

CS6501: Deep Learning for Visual Recognition. CNN Architectures

The online processing of semantic and pragmatic content

The Kaggle Competitions: An Introduction to CAMCOS Fall 2015

Grade 4: Too Many Cats and Dogs In-Class Lesson Plan

@DEVONPERSING DESIGNING FOR ACCESSIBILITY

Semantics. These slides were produced by Hadas Kotek.

[Boston March for Science 2017 photo Hendrik Strobelt]

Part4. Saint Fatima Language School Form 3 Second Term 2018 / The Vision of the School : Distinct Environment for Refined Education

SAN ĠORĠ PRECA COLLEGE PRIMARY SCHOOLS. Half Yearly Exams Year 5 ENGLISH Time: 1 hour 15 minutes. Reading Comprehension, Language and Writing

What is Parallel Structure?

VENTURA COLLECTION STONEWYNN COLLECTION COMO COLLECTION OUTDOOR MODULAR COLLECTIONS:

Social Listening Among Pet Parents CASE STUDY

Let s Talk Turkey Selection Let s Talk Turkey Expository Thinking Guide Color-Coded Expository Thinking Guide and Summary

Grade 3, Prompt for Opinion Writing

Titre du rapport éventuellement en plusieurs lignes

Solving Problems Part 2 - Addition and Subtraction

The Three Little Pigs By Joseph Jacobs 1890

Year end test English - Grade 11 Times 3 Hours

THE EXPOSITORY PILLAR

Go, Dog. Go! PLAYGUIDE. The Story Dogs, dogs, everywhere! Big ones, little ones, at work and at play. The CATCO

Explorers 3. Teacher s notes for the Comprehension Test: The Ugly Duckling. Answer key 1b 2a 3a 4c 5a 6b 7b 8c 9a 10c

Chapter 6: Extending Theory

IMAGE CAPTIONING USING PHRASE-BASED HIERARCHICAL LSTM MODEL

Perplexity of n-gram and dependency language models

Building Concepts: Mean as Fair Share

Differentiated Activities for Teaching Key

Big Dogs Little Dogs

ST NICHOLAS COLLEGE HALF YEARLY PRIMARY EXAMINATIONS. February YEAR 4 ENGLISH TIME: 1hr 15 min (Reading Comprehension, Language, and Writing)

What kind of Theory do we need for English Syntax? Are languages finite? Could we list all the sentences of English?

Logical Forms. Prof. Sameer Singh CS 295: STATISTICAL NLP WINTER February 16, 2017

Chapter 6: Extending Theory

Grade 4: Too Many Cats and Dogs In-Class Lesson Plan

ENGL-3 MMS Running on Water Quiz Exam not valid for Paper Pencil Test Sessions

FIRST TERM READING REVISION PAPER ENGLISH LANGUAGE GRADE 3

Today s Agenda. Why does this matter? A Dangerous Mind. Data Collection. Data Analysis. Data Interpretation. Case Studies

Objectives Students will use titles as an aid to predicting the main idea and supporting details of a passage. use a web to summarize.

November Final Report. Communications Comparison. With Florida Climate Institute. Written by Nicole Lytwyn PIE2012/13-04B

Please initial and date as your child has completely mastered reading each column.

Your web browser (Safari 7) is out of date. For more security, comfort and the best experience on this site: Update your browser Ignore

ENTRY CLERK MANUAL FOR THE ENTRYCLERK.CFA.ORG WEB APPLICATION. Page 1

A marmoset monkey has finally emerged from its hiding hole after three days on the run.

Week 42: Siamese Network: Architecture and Applications in Visual Object Tracking. Yuanwei Wu

Teacher Guide Teacher Answer Key and Kentucky Core Academic Standards for RPA 1 Grade 3

Day 1 Day 2 Day 3 Day 4 Day 5. nouns and adjectives. Nouns. Nouns Adjectives. Verbs (progressive)

MSc in Veterinary Education

Table of Contents. UNIT 1 Key Ideas and Details. UNIT 2 Craft and Structure. UNIT 3 Integrate Knowledge and Ideas

Teacher Guide Teacher Answer Key and Kentucky Core Academic Standards for RDA 1 Grade 3

Notes and INFORMATION

Adaptations: Changes Through Time

Your web browser (Safari 7) is out of date. For more security, comfort and the best experience on this site: Update your browser Ignore

Caring and. sharing. We love Hong Kong. 2 Small houses News report. 3 Food in a basin Fun and games Description. 4 Computer Jobs Biography

I will learn to talk about. groups of animals animal characteristics animal habitats. Unit Unit 7

Bella. Scholastic Short Reads Sample

The weekly passage discussed issues related to dog ownership. Here is some information that might be helpful to students less familiar the topic.

Section: 101 (2pm-3pm) 102 (3pm-4pm)

Dinosaurs. Lesson 1 Amazing dinosaurs. 1 Talk about it What do you know about dinosaurs?

Boosting Biomedical Entity Extraction by Using Syntactic Patterns for Semantic Relation Discovery

ENGLISH LANGUAGE GRADE 3 TERM END READING REVISION

JOBS. Cool. Warm Up. Discuss these questions with a partner. 1. What are the people in the photo doing? Would you like this job?

English Language Arts

Ruxan dra Diana Dragolea. Carmen Andonie CARMEN'S CAT BOOK

User Manual. Senior Project Mission Control. Product Owner Charisse Shandro Mission Meow Cat Rescue and Adoptions, Inc.

Shared Humanity Written by Marilee Joy Mayfield

Introduction to Storytelling & Synthesis

Piecing Together the Story of Dinosaurs from Fossils By Readworks

TITLE: Recognition and Diagnosis of Sepsis in Rural or Remote Areas: A Review of Clinical and Cost-Effectiveness and Guidelines

Litter Education Theme 1: Defining

2013 AVMA Veterinary Workforce Summit. Workforce Research Plan Details

The Cat in the Hat. by Dr. Seuss. Teacher & Student Guide. by Heather Hall. Developing Thinkers. Research, Reason, Relate, Record. PAHS...

English *P48988A0112* E202/01. Pearson Edexcel Functional Skills. P48988A 2015 Pearson Education Ltd. Level 2 Component 2: Reading

ST. NICHOLAS COLLEGE HALF-YEARLY PRIMARY EXAMINATIONS February YEAR 5 ENGLISH TIME: 1 h 15 min. (Reading Comprehension, Language and Writing)

Connecting Literature and Math - Component of STEM Curriculum

Literacy Lesson Ideas

ESL Writing & Computerized Accuplacer ESL (Reading, Listening, Language Use)

EDUCATION GUIDE HENRY AND MUDGE. Tuesday, April 10, :30am and 12:30pm

2019 Championships Qualification and Info Sheet

Natural Language Processing (NLP)

Teach Your Dog To Read By Bonnie Bergin Ed.D., Sharon Hogan

DOWNLOAD OR READ : THE OTHER KITTEN PDF EBOOK EPUB MOBI

Pete The Cat: Valentine's Day Is Cool PDF


Trapped in a Sea Turtle Nest

Training Test. Prepared by Ibrahim Ali and Mohammad Surwar

Transcription:

VISUALIZING TEXT Petra Isenberg

RECAP STRUCTURED DATA UNSTRUCTURED DATA

(TODAY) VISUALIZING TEXT

nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. Lorem ipsum dolor sit amet, TEXT consectetur IS adipisicing DIFFERENT elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi COMMON ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat UNSTRUCTURED cupidatat non proident, (MOSTLY) sunt in culpa qui officia deserunt mollit anim id est laborum. Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed HIGH-DIMENSIONAL do eiusmod tempor incididunt ut (10,000+) labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo BIG! consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum

WHY VISUALIZE TEXT?

WHY To assist information retrieval To enable linguistic analysis To augment analytics on mixed data Themescape Visual Thesaurus Thread Arcs

WHY VISUALIZE UNDERSTANDING: GET THE GIST OF A DOCUMENT TEXT? GROUPING: CLUSTER FOR OVERVIEW OR CLASSIFICATION COMPARE: COMPARE DOCUMENT COLLECTIONS, OR INSPECT EVOLUTION OF COLLECTION OVER TIME CORRELATE: COMPARE PATTERNS IN TEXT TO THOSE IN OTHER DATA, E.G., CORRELATE WITH SOCIAL NETWORK

WHAT IS TEXT DATA? DOCUMENTS ARTICLES, BOOKS AND NOVELS COMPUTER PROGRAMS E-MAILS, WEB PAGES, BLOGS TAGS, COMMENTS COLLECTION OF DOCUMENTS MESSAGES (E-MAIL, BLOGS, TAGS, COMMENTS) SOCIAL NETWORKS (PERSONAL PROFILES) ACADEMIC COLLABORATIONS (PUBLICATIONS) EVEN WHOLE LIBRARIES, WEBSITES, SOCIAL NETWORKS

DIFFICULT DATA TOO MUCH DATA Millions of blog posts, Hundreds of thousands of news stories, 183 billion emails,... per day NOISY DATA 70-72% of email is spam Text contains section headings, figure captions, and direct quotes.

ONCE YOU HAVE THE DATA... Most meaning comes from our minds and common understanding. How much is that doggy in the window? how much: social system of barter and trade (not the size of the dog) doggy implies childlike, plaintive, probably cannot do the purchasing on their own in the window implies behind a store window, not really inside a window, requires notion of window shopping (Hearst, 2006)

LANGUAGE IS AMBIGUOUS Words and phrases can have many meanings, determined by context and world knowledge. Interesting language is often figurative: You are a couch potato. They fought like cats and dogs. Opportunity knocked on the door

VISUAL CONSIDERATIONS Supporters of Martin, who has been jailed without trial for more than two years, are calling on Prime Minister Stephen Harper to ask Mexican president Felipe Calderon to release Martin text is not preattentive under a section of the Mexican constitution that allows the government to expel undesirables from the country. Martin's supporters believe she has no chance of a fair trial in Mexico. Neither does Waage.

VISUAL CONSIDERATIONS Supporters of Martin, who has been jailed without trial for more than two years, are calling on Prime Minister Stephen Harper to ask Mexican president Felipe Calderon to release Martin text is not preattentive under a section of the Mexican constitution that allows the government to expel undesirables from the country. Martin's supporters believe she has no chance of a fair trial in Mexico. Neither does Waage.

VISUAL CONSIDERATIONS

VISUALIZING LANGUAGE IS ALSO EASY! SO much data available for analysis (Mostly) readily computer readable Simple techniques can give instant summaries

OUTLINE TEXT AS DATA VISUALIZING DOCUMENT CONTENT EVOLVING DOCUMENTS DOCUMENT COLLECTIONS

TEXT AS DATA

Words are the basic unit of data.

WORD-LEVEL ATTRIBUTES WORD LENGTH PART OF SPEECH (NOUN, VERB, ADJECTIVE, ETC.) FORMAT (ITALIC, UNDERLINE, ETC.) LANGUAGE (ENGLISH? LATIN? JAPANESE?) FREQUENCY / DIFFICULTY (IS IT COMMON?) SENTIMENT (POSITIVE OR NEGATIVE CONNOTATION) SYNONYMS / ANTONYMS / ETYMOLOGY (OTHER MEANINGS? ROOTS?) ENTITIES (e.g. Calgary, Obama, Telus ) AND MANY MORE

AGGREGATION REPETITION PLAGARISM SHARED ENTITIES AUTHOR STYLE COLLECTION DOCUMENT SECTION PAGE PARAGRAPH SENTENCE WORD TENSE SENTIMENT SENTENCE LENGTH READING LEVEL

LINGUISTIC METHODS Word Counting Word Scoring Stemming Stop Word Removal Part of Speech Tagging Parsing Word Sense Disambiguation Named Entity Recognition Semantic Categorization Sentiment Analysis Topic Modeling (some caveats)

NAMED ENTITY RECOGNITION IDENTIFY AND CLASSIFY NAMED ENTITIES IN TEXT: JOHN SMITH IS A PERSON SOVIET UNION IS A COUNTRY 2500 UNIVERSITY DR IS AN ADDRESS (555) 867-5309 IS A PHONE NUMBER ENTITY RELATIONS: HOW DO THE ENTITIES RELATE? DO THEY CO-OCCUR IN A DOCUMENT? IN A SENTENCE?

TEXT PROCESSING PIPELINE TOKENIZATION: SEGMENT TEXT INTO TERMS ENTITIES? SAN FRANCISCO, O CONNOR, U.S.A. REMOVE STOP WORDS? A, AN, THE, TO, BE N-GRAMS? CAN TAKE WORDS IN 2-WORD GROUPS (BI-GRAMS), 3-WORD (TRI-GRAMS), ETC. STEMMING: GROUP TOGETHER DIFFERENT FORMS ROOTS: VISUALIZATION(S), VISUALIZE(S), VISUALLY VISUAL LEMMATIZATION: GOES, WENT, GONE GO FOR VISUALIZATION, SOMETIMES NEED TO REVERSE STEMMING FOR LABELS SIMPLE SOLUTION: MAP FROM STEM TO THE MOST FREQUENT WORD RESULT: ORDERED STREAM OF TERMS

TEXT PROCESSING The PIPELINE quick brown fox jumps over the lazy dog. TOKENIZE (N=1) [The], [quick], [brown], [fox], [jumps], [over], [the], [lazy], [dog]. TOKENIZE (N=1), REMOVE STOPWORDS, STEM [quick], [brown], [fox], [jump], [over], [lazy], [dog] TOKENIZE (N=2) [the quick], [quick brown], [brown fox], [fox jumps], [jumps over], [over the] TOKENIZE (N=5) [the quick brown fox jumps], [quick brown fox jumps over], [brown fox jumps over

NLTK (NATURAL LANGUAGE TOOLKIT) NLTK.org Python

VISUALIZING DOCUMENT CONTENT

TAG CLOUDS WORD COUNT http://tagcrowd.com/ THESIS WESLEY WILLETT

TAG CLOUDS WORD COUNT www.jasondavies.com/wordcloud/ THESIS WESLEY WILLETT

WHAT S PROBLEMS DO YOU SEE WITH TAG CLOUDS?

TAG CLOUDS STRENGTHS CAN HELP WITH GISTING AND INITIAL QUERY FORMATION. WEAKNESSES SUB-OPTIMAL VISUAL ENCODING (SIZE VS. POSITION) INACCURATE SIZE ENCODING (LONG WORDS ARE BIGGER) MAY NOT FACILITATE COMPARISON (UNSTABLE LAYOUT) ORDER USUALLY MEANINGLESS (USUALLY ALPHABETICAL OR RANDOM) TERM FREQUENCY MAY NOT BE MEANINGFUL DOES NOT SHOW THE STRUCTURE OF THE TEXT

WORD COUNTS

WORDCOUNT http://wordcount.org JONATHAN HARRIS

CONCORDANCE WHAT IS THE COMMON LOCAL CONTEXT OF A TERM?

WORD TREES cats are better than dogs cats eat kibble cats are better than hamsters cats are awesome cats are people too cats eat mice cats meowing cats in the cradle cats eat mice cats in the cradle lyrics cats eat kibble cats for adoption cats are family cats eat mice cats are better than kittens cats are evil cats are weird cats eat mice WATTENBERG & VIÉGAS 2008

FILTER INFREQUENT RUNS

WORDSEER MURALIDHARAN & HEARST

RECURRENT THEMES IN SPEECH

GLIMPSES OF STRUCTURE CONCORDANCES SHOW LOCAL, REPEATED STRUCTURE BUT WHAT ABOUT OTHER TYPES OF PATTERNS? FOR EXAMPLE LEXICAL: <A> at <B> SYNTACTIC: <Noun> <Verb> <Object>

PHRASE NETS LOOK FOR SPECIFIC LINKING PATTERNS IN THE TEXT: A AND B, A AT B, A OF B, ETC COULD BE OUTPUT OF REGEXP OR PARSER VISUALIZE EXTRACTED PATTERNS IN A NODE-LINK VIEW OCCURRENCES = NODE SIZE PATTERN POSITION = EDGE DIRECTION van Ham et al

X and Y PORTRAIT OF THE ARTIST AS A YOUNG MAN JAMES JOYCE

NODE GROUPING

THE BIBLE {X} begat {Y}

18 TH & 19 TH CENTURY NOVELS {X} s {Y}

OLD TESTAMENT {X} of {Y}

X of Y NEW TESTAMENT {X} of {Y}

VISUALIZING DOCUMENT COLLECTIONS

newsmap.jp

DOCUMENT CARDS SMALL MULTIPLES FOR DOCUMENTS

THEMERIVER HAVRE ET AL 1999

PARALLEL TAG CLOUDS COLLINS ET AL. 09

SUPPORTING SEARCH TileBars Hearst 1999

SeeSoft Eick 1994

JIGSAW

CENDARI NOTE-TAKING ENVIRONMENT 2015

DOCUMENT SIMILARITY & CLUSTERING COMPUTE SIMILARITY BETWEEN DOCUMENTS BASED ON THE WORDS THEY SHARE TF-IDF (TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY) IS COMMON TOPIC MODELING APPROACHES ASSUME DOCUMENTS ARE A MIXTURE OF TOPICS TOPICS ARE (ROUGHLY) A SET OF CO-OCCURRING TERMS LATENT SEMANTIC ANALYSIS (LSA): REDUCE TERM MATRIX MANY, MANY APPROACHES EXIST

STANFORD DISSERTATION BROWSER CHUANG, RAMAGE, MANNING & HEER 2012

STANFORD DISSERTATION BROWSER CHUANG, RAMAGE, MANNING & HEER 2012

WARNING OFTEN, TEXT VISUALIZATIONS DO NOT REPRESENT TEXT DIRECTLY, BUT THEY REPRESENT A MODEL WORD COUNTS, WORD SEQUENCES, CLUSTERS, ETC. ASK: CAN YOU INTERPRET THE VISUALIZATION? DOES THE MODEL ACCURATELY REPRESENT THE ORIGINAL TEXT?

LESSONS FOR TEXT VISUALIZATION SHOW SOURCE TEXT (OR PROVIDE ACCESS TO IT) WHERE POSSIBLE, USE VISUALIZATION AS INDEX INTO DOCUMENTS GROUP DOCUMENTS IN MEANINGFUL WAYS WILL VIEWERS UNDERSTAND THE CLUSTERS? WHERE POSSIBLE USE TEXT TO REPRESENT TEXT

HUNDREDS OF TOOLS & TECHNIQUES FOR TEXT AT http://textvis.lnu.se/

QUESTIONS?

EXAM 2h, Dec 8th bring a pencil questions from lectures (at least 1 per lecture) some creativity questions some questions about assessing visualizations every student gets individual exam sheet

EXAM best way to mark a box: unacceptable way to mark a box: if you make an error erase your answer if you forgot your eraser, mark the box like this

ACKNOWLEDGEMENTS Slides in were inspired, adapted, taken from slides by Christopher Collins (University of Ontario Institute of Technology) Wesley Willett (University of Calgary)