Image and Text CSE 408 Multimedia Info Sys. Yezhou Yang Lots of slides from Tamara Berg
People, Pictures, and Language Can you hand me the remote? Tags: canon, eos, macro, japan, vacation, frog, animal, toad, amphibian, pet, eye, feet, mouth, finger, hand, prince, photo, art, light, photo, flickr, blurry, favorite. It was an arresting face, pointed of chin, square of jaw. Her eyes were pale green without a touch of hazel, starred with bristly black lashes... Scarlett O Hara, Gone with the Wind. It's the perfect party dress. With distinctly feminine details such as a wide sash bow around an empire waist and a deep scoopneck, this linen dress will keep you comfortable and feeling elegant all evening long. * Measures 38" from center back, hits at the knee. * Scoopneck, full skirt. * Hidden side zip. * 100% Linen. Dry clean. People describe the world around them all the time
Descriptive Text It was an arresting face, pointed of chin, square of jaw. Her eyes were pale green without a touch of hazel, starred with bristly black lashes and slightly tilted at the ends. Above them, her thick black brows slanted upward, cutting a startling oblique line in her magnolia-white skin that skin so prized by Southern women and so carefully guarded with bonnets, veils and mittens against hot Georgia suns Scarlett O Hara described in Gone with the Wind. How does the world work? Visually descriptive language provides: Information about the world, especially the visual world. information about how people construct natural language for imagery. guidance for visual recognition. What should How do people we recognize? describe the world?
More Nuance than Traditional Recognition person car shoe
Human centric recognition outputs car
Human centric recognition outputs pink car
Human centric recognition outputs car on road
Human centric recognition outputs Little pink smart car parked on the side of a road in a London shopping district. Telling the story of an image
Generating Image Descriptions
Existing Approaches A random Pink Smart Car seen driving around Lambeth Roundabout and onto Lambeth Bridge. Smart Car. It was so adorable and cute in the parking lot of the post office, I had to stop and take a picture. Pink Car Sign Door Motorcycle Tree Brick building Dirty Road Sidewalk London Shopping district Natural language description Generation Methods: 1) Compose descriptions from recognized content 2) Compose descriptions from recognized content & existing captions
Related Work 1) Compose descriptions from recognized content [Yao et al, IEEE 1998], [Kulkarni et al, CVPR 2011], [Li et al, CoNLL 2011], [Yang et al, EMNLP 2011], [Mitchell et al, EACL 2012], [Guadarrama et al, ICCV 2013], [Krishnamoorthy et al, NAACL 2013], [Thomason et al, COLING 2014] 2) Compose descriptions from recognized content & existing captions [Farhadi et al, ECCV 2010], [Feng & Lapata, ACL 2010], [Aker and Gaizauskas, ACL 2010], [Ordonez et al, NIPS 2011], [Kuznetsova et al, ACL 2012], [Kuznetsova et al, TACL 2014]
Example 1: Compose descriptions from recognized content Baby Talk: Understanding and Generating Simple Image Descriptions Girish Kulkarni, Visruth Premraj, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C Berg, Tamara L Berg CVPR 2011
This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant.
This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant.
This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant.
This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant.
This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant.
This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant.
This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant.
This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant.
This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant.
This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant.
Methodology Vision -- detection and classification predictions Text inputs - statistics from parsing lots of descriptive text Graphical model (CRF) to predict best image labeling given vision and text inputs Generation algorithms to generate natural language
Vision is hard! Green sheep World knowledge (from descriptive text) can be used to smooth noisy vision predictions!
Methodology Vision -- detection and classification predictions Text -- statistics from parsing lots of descriptive text Graphical model (CRF) to predict best image labeling given vision and text inputs Generation algorithms to generate natural language
Learning from Descriptive Text Attributes green green grass by the lakea very shiny car in the car museum in my hometown of upstate NY. Relationships very little person in a big rocking chair Our cat Tusik sleeping on the sofa near a hot
Methodology Vision -- detection and classification predictions Text -- statistics from parsing lots of descriptive text Graphical model (CRF) to predict best image labeling given vision and text inputs Generation algorithms to generate natural language
Methodology Vision -- detection and classification predictions Text -- statistics from parsing lots of descriptive text Model (CRF) to predict best image labeling given vision and text inputs Generation algorithms to compose natural language
System Flow a) dog b) person Input Image +, ($% 0.01 brown near(a,b) 1 striped 1 near(b,a) 0.16 against(a,b)! "# $% furry.26.11 wooden.2 against(b,a) feathered..04 06 beside(a,b).... 24brown 0.32 ' ()*- % beside(b,a). striped near(a,c) 1 170.09 near(c,a) 1 furry....04. against(a,c) 3 wooden.2 Feathered. against(c,a)! "# -% 0504.... beside(a,c) +, (% 5 beside(c,a) near(b,c) 1. brown 0.94 45 near(c,b) striped 1 ' ()*$% +, (&% This is a photograph of one person and!one brown sofa "# &% and one dog. The person is against the brown sofa. And the dog is near the person, and beside the brown sofa. ' ()*&% <<null,person_b>,against,<brown,sofa_c>>... against(b,c). 0.10 <<null,dog_a>,near,<null,person_b>> 67furry.06 Generate natural <<null,dog_a>,beside,<brown,sofa_c>> against(c,b) wooden.8. language Predict labeling vision c) sofa description potentials smoothed with ExtractPredict Predict Objects/stuf prepositions attributes text potentials 33Feathered. beside(b,c). 08 0... beside(c,b). 19
Some good results This is a picture of one sky, one road and one sheep. The gray sky is over the gray road. The gray sheep is by the gray road. Here we see one road, one sky and one bicycle. The road is near the blue sky, and near the colorful bicycle. The colorful bicycle is within the blue sky. This is a picture of two dogs. The first dog is near the second furry
Some bad results Missed detections: Here we see one potted plant. This is a picture of one dog. False detections: There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one road and one person. The rusty tree is under the red road. The colorful person is near the rusty tree, and under the red road. Incorrect attributes: This is a photograph of two sheeps and one grass. The first black sheep is by the green grass, and by the second black sheep. The second black sheep is by the green grass. This is a photograph of two horses and one grass. The first feathered horse is within the green grass, and by the second feathered horse. The second feathered horse is within
Thoughts? Novelty? Pros/Cons? What applications might this be useful for? What would you change/do next? 32
Us vs Humans This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant. H1: A Lemonaide stand is manned by a blonde child with a cookie. H2: A small child at a lemonade and cookie stand on a city corner. H3: Young child behind lemonade stand eating a cookie. Sounds unnatural UIUC pascal sentence dataset Rashtchian, Young, Hodosh and Hockenmaier NAACL HLT 2010
Example 2: Compose Descriptions from Recognized Content + Existing Descriptions
Composing captions guessing game a) monkey playing in the tree canopy, Monte Verde in the rain forest b) capuchin monkey in front of my window c) monkey spotted in Apenheul Netherlands under the tree d) a white-faced or capuchin in the tree in the garden e) the monkey sitting in a tree, posing for his picture
Composing captions guessing game a) monkey playing in the tree canopy, Monte Verde in the rain forest b) capuchin monkey in front of my window c) monkey spotted in Apenheul Netherlands under the tree d) a white-faced or capuchin in the tree in the garden e) the monkey sitting in a tree, posing for his picture
Through the smoke Duna Portrait #5 Mirror and gold the cat lounging in the sink Data exists, but buried in
Captions in the Wild http://tamaraberg.com/sbucaptions The Egyptian cat statue by the floor clock and perpetual motion machine in the pantheon Man sits in a rusted car buried in the sand on Waitarere beach Little girl and her dog in northern Thailand. They both seemed interested in what we were doing Our dog Zoe in her bed Interior design of modern white and brown living room furniture against white wall with a lamp hanging. Emma in her hat looking super cute Ordonez et al, NIPS 2011
Ordonez et al, NIPS 2011 Harness the Web Captioned Photo Dataset Global Matching (GIST + Color) 1 million captioned images! The bridge over the lake on Suzhou Street. Bridge to temple in Hoan Kiem lake. A walk around the lake near our house with Abby. Transfer Whole Caption(s) e.g. The bridge over the lake on Suzhou Street. Smallest house in paris between red (on right) and beige (on left). Hangzhou bridge in West lake. The daintree river by boat.
Kuznetsova et al, ACL 2012 Transfer pieces of Captions Object appearance NP: the dirty sheep Object pose VP: meandered along a desolate road Scene appearance PP: in the highlands of Scotland Region appearance & PP: through frozen grass relationship Example Composed Description:the dirty sheep meandered along a desolate road in the highlands of Scotland through frozen grass
Image Description Generation Computer Vision Generation Objects, Actions, Stuf, Scenes Phrase Retrieval Description
Image Description Generation Compute r Vision Generatio n Objects, Actions, Stuf, Scenes Phrase Retrieval Descriptio n
Retrieving VPs Contented dog just laying down in front of a house.. Peruvian dog sleeping on city street in the city of Cusco, (Peru) Detect: dog Find matching detections by pose similarity this dog was laying in Closeup of my dog the middle of the road sleeping under my desk. on a back street in jaco
Retrieving NPs Tray of glace fruit in the market at Nice, France Fresh fruit in the market Detect: fruit Find matching detections by appearance similarity A box of oranges was just catching the sun, bringing out detail in the skin. mandarin oranges in glass The street market in Santanyi, Mallorca is a must for the oranges and local crafts. An orange tree in the backyard of
Retrieving PPstuf Find matching regions by appearance + arrangement similarity Detect: stuf Cordoba - lonely elephant under an orange tree... Comfy chair under a I positioned the chairs around the lemon tree -- it's like a shrine Mini Nike soccer ball all alone in the grass
Retrieving PPscene Extract scene descriptor Find matching images by global scene similarity I'm about to blow the building across the street over with my massive Pedestrian street in thelung power. Old Lyon with stairs to climb up the hill of fourviere View from our B&B in this photo Only in Paris will you find a bottle of wine on a table outside a
Image Description Generation Compute r Vision Generation Objects, Actions, Stuf, Scenes Phrase Retrieval Description
Sentence
Object NPs birds the bird Actions VPs are standing looking for food Stuff PPs Scene PPs Sentence in water over water in the ocean near Salt Pond
Object NPs birds the bird Actions VPs are standing looking for food Stuff PPs Scene PPs in water over water in the ocean near Salt Pond Position 1 Position 2 Position 3 Position 4 birds over water are standing in the ocean
Possible Assignments Position1 Position2 Position3 Position4 birds birds birds birds the bird the bird the bird the bird are standing are standing are standing are standing in the ocean in the ocean in the ocean in the ocean
Possible Assignments Position1 Position2 Position3 Position4 birds birds birds birds the bird the bird the bird the bird are standing are standing are standing are standing in the ocean in the ocean in the ocean in the ocean
Possible Assignments Position1 Position2 Position3 Position4 birds birds birds birds the bird the bird the bird the bird are standing are standing are standing are standing in the ocean in the ocean in the ocean in the ocean
Dynamic Programming Position1 Position2 Position3 Position4 birds birds birds birds the bird the bird the bird the bird are standing are standing are standing are standing in the ocean in the ocean in the ocean in the ocean
Phrases of the Same Type Position1 Position2 Position3 Position4 birds birds birds birds the bird the bird the bird the bird are standing are standing are standing are standing in the ocean in the ocean in the ocean in the ocean
Phrases of the Same Type Position1 Position2 Position3 Position4 birds birds birds birds the bird the bird the bird the bird are standing are standing are standing are standing in the ocean in the ocean in the ocean in the ocean
Singular/Plural Relationships Position1 Position2 Position3 Position4 birds birds birds birds the bird the bird the bird the bird are standing are standing are standing are standing in the ocean in the ocean in the ocean in the ocean
Integer Linear Programming (ILP) Integer Variables Constraints x1 x2 xn Linear Function Max/Min? Variables Values?
ILP for Surface Realization Phrase j Phrase q Position k Position (k+1) birds birds the bird the bird are standing are standing in the ocean in the ocean
Binary Variables Position k Phrase j Phrase q herons =1 Position (k+1) herons =0 the bird the bird =0 are fishin are fishing =1 in the ocean in the ocean
Optimization Function Phrase Selection Phrase j Position k Google Web 1-T Dataset Phrase = Ngram cohesion+ Head word Compatibility co-occurrence
Constraints Ensure that contiguous positions for phrases are selected Always select a Noun Phrase Select at most one phrase of each type
ILP Optimization Vision scores o Visual detection/classification scores Optimize for: Phrase cohesion o n-gram statistics between phrases o Co-occurrence statistics between phrase head words Linguistic constraints Subject to: o Allow at most one phrase of each type o Enforce plural/singular agreement between NP and VP Discourse constraints o Prevent inclusion of repeated phrasing
Thoughts? Novelty? Pros/Cons? What applications might this be useful for? What would you change/do next? 64
Pros/Cons 1) Detect & Generate from Scratch 2) Retrieval-Based Generation This is a picture of two dogs. The first dog is near the second furry dog. the dirty sheep meandered along a desolate road in the highlands of Scotland through frozen grass Produces visually relevant descriptions Generates more natural/creative descriptions May be useful for applications for the visually impaired Can transfer language without modeling everything May bore the user Existing web text may contain (non-visual) contextual
Generating Natural-Language Video Descriptions Using TextMined Knowledge Krishnamoorthy, Malkarmenkar, Mooney, Saenko, Guadarrama.. 66
Video Description Dataset (Chen & Dolan, ACL 2011) 2,089 YouTube videos with 122K multi-lingual descriptions. Originally collected for paraphrase and machine translation examples. Available at: http://www.cs.utexas.edu/users/ml/clamp/videodescription/
Sample M-Turk Human Descriptions (average ~50 per video) A MAN PLAYING WITH TWO DOGS A man takes a walk in a field with his dogs. A man training the dogs in a big field. A person is walking his dogs. A woman is walking her dogs. A woman is walking with dogs in a field. A woman is walking with four dogs outside. A woman walks across a field with several dogs. All dogs are going along with the woman. dogs are playing Dogs follow a man. Several dogs follow a person. some dog playing each other Someone walking in a field with dogs. very cute dogs A MAN IS GOING WITH A DOG. The woman is walking her dogs. A person is walking some dogs. A man walks with his dogs in the field. A man is walking dogs. a dogs are running A guy is training his dogs A man is walking with dogs. a men and some dog are running A men walking with dogs. A person is walking with dogs. A woman is walking her dogs. Somebody walking with his/her pets. the man is playing with the dogs. A guy training his dogs. A lady is roaming in the field with his dogs. A lady playing with her dogs. A man and 4 dogs are walking through a field.
Video Description Task Generate a short, declarative sentence describing a video in this corpus. First generate a subject (S), verb (V), object (O) triplet for describing thecontent video. planning o <cat, play, ball> Next generate a grammatical sentence from this triplet. Surface Realization o A cat is playing with a ball.
SUBJEC T person VERB ride A person is riding a motorbike. OBJECT motorbi ke
OBJECT DETECTIONS table dog car 0.07 0.15 0.29 aeroplane 0.05 cow motorbike 0.51 0.11 person train 0.17 0.42
SORTED OBJECT DETECTIONS motorbike 0.51 person 0.42 car 0.29 aeroplane 0.05
VERB DETECTIONS move slice ride 0.34 0.13 dance 0.19 hold climb 0.17 0.23 0.05 drink shoot 0.07 0.11
SORTED VERB DETECTIONS move hold ride dance 0.23 0.19 0.05 0.34
ORTED OBJECT DETECTIONS SORTED VERB DETECTIONS move motorbike 0.51 person hold 0.42 car ride 0.29 aeroplane 0.05 dance 0.23 0.19 0.05 0.34
OBJEC TS VERBS EXPAND VERBS move 1.0 walk 0.8 pass 0.8 ride 0.8
OBJEC TS VERBS EXPAND VERBS hold 1.0 keep 1.0
OBJEC TS VERBS EXPAND VERBS ride 1.0 go 0.8 move 0.8 walk 0.7
OBJEC TS VERBS EXPAND VERBS dance 1.0 turn 0.7 jump 0.7 hop 0.6
Web-scale text corporabnc, GigaWord, ukwac, WaCkypedia, GoogleNgrams OBJEC TS VERBS GET DEPENDENCY PARSES EXPANDE D VERBS A man rides a horse det(man-2, A-1) nsubj(rides-3, man-2) root(root-0, rides-3) det(horse-5, a-4) dobj(rides-3, horse-5) <person, ride, horse> Subject-Verb-Object triplet
Web-scale text corporabnc, GigaWord, ukwac, WaCkypedia, GoogleNgrams OBJEC TS VERBS EXPANDE D VERBS <person, ride, horse> <person, walk, dog> <person, hit, ball>... SVO Language Model
Web-scale text corporabnc, GigaWord, ukwac, WaCkypedia, GoogleNgrams OBJEC TS VERBS EXPAND ED VERBS <person, ride, horse> <person, walk, dog> <person, hit, ball>... SVO Language Model Regular Language Model
Web-scale text corporabnc, GigaWord, ukwac, WaCkypedia, GoogleNgrams OBJEC TS VERBS SVO LANGUAGE MODEL EXPAND ED VERBS CONTENT PLANNING: <person, ride, motorbike> REGULAR LANGUAGE MODEL
Web-scale text corporabnc, GigaWord, ukwac, WaCkypedia, GoogleNgrams OBJEC TS VERBS SVO LANGUAGE MODEL EXPAND ED VERBS CONTENT PLANNING: <person, ride, motorbike> SURFACE REALIZATION: A person is riding a motorbike. REGULAR LANGUAGE MODEL
Selecting SVO Just Using Vision (Baseline) Top object detection from vision = Subject Next highest object detection = Object Top activity detection = Verb
Test Data Selected 185 test videos that contain one of the 20 detectable (Pascal) objects and 58 detectable activities based on their words (or synonyms) appearing in their human descriptions.
Good Examples
Bad Examples
SVO Accuracy Results (w1 = 0) Binary Accuracy Subject Activity Object All Vision baseline 71.35% 8.65% 29.19% 1.62% SVO LM (No Verb Expansion) 85.95% 16.22% 24.32% 11.35% SVO LM (Verb Expansion) 85.95% 36.76% 33.51% 23.78% WUP Accuracy Subject Activity Object All Vision baseline 87.76% 40.20% 61.18% 63.05% SVO LM (No Verb Expansion) 94.90% 63.54% 69.39% 75.94% SVO LM (Verb Expansion) 94.90% 66.36% 72.74% 78.00%
Automatic Evaluation of Sentence Quality Evaluate generated sentences using standard Machine Translation (MT) metrics. Treat all human provided descriptions as reference translations
Human Evaluation of Descriptions Asked 9 unique MTurk workers to evaluate descriptions of each test video. Asked to choose between vision-baseline sentence, SVO-LM (VE) sentence, or neither. When preference expressed, 61.04% preferred SVO-LM (VE) sentence. For 84 videos where the majority of judges had a clear preference, 65.48% preferred the SVOLM (VE) sentence.
Thoughts? Novelty? Pros/Cons? What applications might this be useful for? What would you change/do next? 93
Toward understanding how people describe images 1. What should we describe? 2. What should we call image content? 3. How should we refer to specific objects/content within an image?
1) What should we describe? What s in this image? What do people describe? A bearded man is holding a child in a sling. A bearded man stands while holding a small child in a green sheet. A bearded man with a baby in a sling poses. Man standing in kitchen with little girl in green sack. Man with beard and baby man baby sling ladder fridge table watermelon chair wall pacifier beard glasses shirt Berg et al, CVPR 2012
Content Importance Varies Semantics matters! Berg et al, CVPR 2012
Importance Factors What factors influence what someone will find important (describe) about an image? Kinds of factors: Compositional Semantic Contextual
Compositional factors Size/Saliency Location A sail boat on the ocean.
Semantic factors Object Type Scene Type & Depiction Strength girl in the street
Contextual factors Object-Scene Unusualness Attribute-Object Unusualness A tree in water and a boy with a beard
Toward understanding how people describe images 1. What should we describe? 2. What should we call image content?
2) What should we call content? Object Organism Animal Chordate Vertebrate Bird Aquatic bird Swan Whistling swan Cygnus Colombianus Ordonez et al, ICCV 2013
Entry-Level Categories The category that people are likely to name when presented with a depiction of an object. Rosch et al, 1976 Jolicoeur, Gluck & Kosslyn, 1984 Superordinates: animal, vertebrate Entry Level: bird Subordinates: Black-capped chickadee Ordonez et al, ICCV 2013
Entry-Level Categories The category that people are likely to name when presented with a depiction of an object. Rosch et al, 1976 Jolicoeur, Gluck & Kosslyn, 1984 Superordinates: animal, bird Entry Level: penguin Subordinates: Chinstrap penguin Ordonez et al, ICCV 2013
Naming Image Content (0.80) (0.16) Grampus griseus American black bear Grizzly bear (0.25) King penguin (0.11) Cormorant (0.56) (0.06) Homing pigeon Ball-peen hammer Spigot (0.07) Diskette, floppy (0.06) Steel arch bridge (0.16) Farmhouse (0.03) (0.13) Soapweed Brazilian rosewood Bristlecone pine (0.04) Clifdiving (0.19) Crabapple (0.73) Vision (0.26) (0.12) Input Image Thousands of Noisy Category Predictions Grampus Naming griseus Pick the Best Dolphin What Should I Call It? Ordonez et al, ICCV 2013
Toward understanding how people describe images 1. What should we describe? 2. What should we call image content? 3. How should we refer to specific objects/content within an image?
3) How should we refer to specific content? Orange Chair in the second row Kazemzadeh et al, EMNLP 201
Human Computer Interaction It is the big blue book on the top shelf Where is Harry Potter?
3) How should we refer to specific content? Bottle
3) How should we refer to specific content? Orange Bottle
3) How should we refer to specific content? Orange Bottle on the Right
REG Datasets GRE3D3 Viethen & Dale 2008 20 scenes TUNA van Deemter et al 2006 Size Corpus Mitchell et al 2011 96 images GenX Corpus FitzGerald et al 2013 269 scenes Typicality Corpus Mitchell et al 2013 35 scenes
Natural Scenes Diverse Many real world objects Complex Many object instances Big 20k images IAPR TC-12 Segmented and Annotated Dataset. Escalante et. al. 2009
http://referitgame.com ReferItGame Collecting referring expressions for objects in real world photos Player 1 Orange bottle on the right Orange bottle on the right Player 2 Kazemzadeh et al, EMNLP 2014
ReferitGame Dataset picture on the wall picture picture Collected: 130,525 expressions, referring to 96,654 objects, in 19,894 photographs big gated window on right of white section black big window right brown railings on right red guy left sitting leftmost bottom guy red shirt on left Kazemzadeh et al, EMNLP 2014