1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189
use std::io::Read; use html5ever::rcdom::NodeData::{ Document, Doctype, Text, Comment, Element, ProcessingInstruction }; use html5ever::rcdom::{RcDom, Handle}; use html5ever::{parse_document, Attribute}; use html5ever::tendril::TendrilSink; use hyper::Client; use hyper::header::Connection; use hyper::header::ConnectionOption; use hyper::net::HttpsConnector; use hyper_native_tls::NativeTlsClient; use Object; use Image; use Audio; use Video; pub fn scrape(url: &str) -> Option<Object> { let tls = NativeTlsClient::new().unwrap(); let connector = HttpsConnector::new(tls); let client = Client::with_connector(connector); let result = client.get(url) .header(Connection(vec![ConnectionOption::Close])) .send(); if result.is_err() { return None; } let mut res = result.unwrap(); if res.status.is_success() { extract(&mut res).map(|mut obj| { obj.images = obj.images.iter().map(|i| { let mut i = i.clone(); i.normalize(&res.url); i }).collect::<Vec<Image>>(); obj }) } else { None } } pub fn extract<R>(input: &mut R) -> Option<Object> where R: Read { let dom = parse_document(RcDom::default(), Default::default()) .from_utf8() .read_from(input) .unwrap(); let mut og_props = Vec::new(); let mut images = Vec::new(); let mut audios = Vec::new(); let mut videos = Vec::new(); walk(dom.document, &mut og_props, &mut images, &mut audios, &mut videos); let mut obj = Object::new(&og_props); obj.images.append(&mut images); obj.audios.append(&mut audios); obj.videos.append(&mut videos); Some(obj) } fn walk(handle: Handle, og_props: &mut Vec<(String, String)>, images: &mut Vec<Image>, audios: &mut Vec<Audio>, videos: &mut Vec<Video>) { match handle.data { Document => (), Doctype { .. } => (), Text { .. } => (), Comment { .. } => (), Element { ref name, ref attrs, ..} => { let tag_name = name.local.as_ref(); match tag_name { "meta" => { let mut ps = extract_open_graph_from_meta_tag(&attrs.borrow()); og_props.append(&mut ps); }, "img" => { if let Some(image) = extract_image(&attrs.borrow()) { images.push(image); } }, "audio" => { if let Some(audio) = extract_audio(&attrs.borrow()) { audios.push(audio); } }, "videos" => { if let Some(video) = extract_video(&attrs.borrow()) { videos.push(video); } }, _ => (), } }, ProcessingInstruction { .. } => unreachable!() } for child in handle.children.borrow().iter() { walk(child.clone(), og_props, images, audios, videos) } } fn attr(attr_name: &str, attrs: &Vec<Attribute>) -> Option<String> { for attr in attrs.iter() { if attr.name.local.as_ref() == attr_name { return Some(attr.value.to_string()) } } None } pub fn extract_open_graph_from_meta_tag(attrs: &Vec<Attribute>) -> Vec<(String, String)> { let mut og_props = vec!(); match extract_open_graph_prop("property", attrs) { Some((key, content)) => og_props.push((key, content)), None => (), } match extract_open_graph_prop("name", attrs) { Some((key, content)) => og_props.push((key, content)), None => (), } og_props } fn extract_open_graph_prop<'a>(attr_name: &str, attrs: &Vec<Attribute>) -> Option<(String, String)> { attr(attr_name, attrs) .and_then(|property| if property.starts_with("og:") { let end = property.chars().count(); let key = unsafe { property.slice_unchecked(3, end) }.to_string(); attr("content", attrs).map(|content| (key, content)) } else { None }) } pub fn extract_image(attrs: &Vec<Attribute>) -> Option<Image> { attr("src", attrs).map(|src| Image::new(src.to_string())) } pub fn extract_audio(attrs: &Vec<Attribute>) -> Option<Audio> { attr("src", attrs).map(|src| Audio::new(src.to_string())) } pub fn extract_video(attrs: &Vec<Attribute>) -> Option<Video> { attr("src", attrs).map(|src| Video::new(src.to_string())) } #[cfg(test)] mod test { use super::*; use object::ObjectType; #[test] fn extract_open_graph_object() { let x = r#" <html prefix="og: http://ogp.me/ns#"> <head> <title>The Rock (1996)</title> <meta property="og:title" content="The Rock" /> <meta property="og:type" content="video.movie" /> <meta property="og:url" content="http://www.imdb.com/title/tt0117500/" /> <meta property="og:image" content="http://ia.media-imdb.com/images/rock.jpg" /> </head> </html> "#; let obj = extract(&mut x.to_string().as_bytes()); assert!(obj.is_some()); let obj = obj.unwrap(); assert_eq!(&obj.title, "The Rock"); assert_eq!(obj.obj_type, ObjectType::Movie); assert_eq!(&obj.url, "http://www.imdb.com/title/tt0117500/"); assert_eq!(obj.images.len(), 1); assert_eq!(&obj.images[0].url, "http://ia.media-imdb.com/images/rock.jpg"); } }