Upload
pablo-recabal
View
445
Download
1
Embed Size (px)
Citation preview
(artists,dates)
pipeline
Data Store & FrontEndStorage & Batch processing
data sources
MusicBrainz
(1.4 TB, 240 million records)
(artists,dates)
clusters
data sources
MusicBrainz
(1.4 TB, 240 million records)
HDFS datanodeSpark executor
HDFS datanodeSpark executor
HDFS datanodeSpark executor
HDFS namenodeSpark driver
Flaskserver
OrientDB master
OrientDB master
4 x m4.large (8GB RAM ea. & 6TB SSD total) 3 x m4.large (32 GB SSD total)
data flow
content
header
WARC/1.0WARC-Type: conversionWARC-Target-URI: http://www.biography.com/people/ella-fitzgerald-9296210WARC-Date: 2014-08-02T09:52:13ZWARC-Record-ID: WARC-Refers-To: WARC-Block-Digest: sha1:JROHLCS5SKMBR6XY46WXREW7RXM64EJCContent-Type: text/plainContent-Length: 6724
Ella Fitzgerald, known as the "First Lady of Song" and "Lady Ella," was an American jazz and song vocalist who interpreted much of the Great American Songbook...
data flow
www.biography.com/people/ella-fitzgerald-9296210, Ella Fitzgeraldwww.oldies.com/product-view/47037M.html, Louis Armstrongbojack.org/2007/06/knock_a_few_bucks_off.html, John Coltrane
WARC/1.0WARC-Type: conversionWARC-Target-URI: http://www.biography.com/people/ella-fitzgerald-9296210WARC-Date: 2014-08-02T09:52:13ZWARC-Record-ID: WARC-Refers-To: WARC-Block-Digest: sha1:JROHLCS5SKMBR6XY46WXREW7RXM64EJCContent-Type: text/plainContent-Length: 6724
Ella Fitzgerald, known as the "First Lady of Song" and "Lady Ella," was an American jazz and song vocalist who interpreted much of the Great American Songbook...
challenges
- How to find the bands: Air, The Clash, Chicago?
~1,4 TB, 274M websites, 1000 artists
- Norah Jones vs Miles Davis?
challenges
- How to find the bands: Air, The Clash, Chicago?
~1,4 TB, 274M websites, 1000 artists
- Norah Jones vs Miles Davis?
challenges
- How to find the bands: Air, The Clash, Chicago?
~1,4 TB, 274M websites, 1000 artists
- Norah Jones vs Miles Davis?
Artist catalog:-MusicBrainz databaste (~1,000,000 entries)
→Jazz subset (1,000 entries)
Artist relationship metric:-CommonCrawl July 2015 log (~145 TB)→ Uncompressed '.wet' files (~1.5 TB)
data specs
JohnColtrane
W1
W10
W6 W5
Norah Jones
W2
W3
W4
MilesDavis
W9
W8
W7
W12
W13
W5
Miles John Norah Total
Miles 5 2 2 9
John 2 4 1 7
Norah 2 1 9 12
model
Miles John Norah Total
Miles 5 2 2 9
John 2 4 1 7
Norah 2 1 9 12
model
Avgerage links between any two artists “X” = (2+2+1)/3 = 1.667
Avgerage links for a single artist “Y”= (9+7+12)/3 = 9.333
=> Average percentage “Z” = X/Y = 17.8 %
bool areConnected(artist A, artist B){aCountsInB = countLinks(A,B) / countLinks (B)bCountsInA = countLinks(A,B) / countLinks (A)
if mean(aCountsInB, bCountsInA) > C *Zreturn true
return false}