LARA

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

parcon18:project9 [2018/05/14 08:25] (current)
romain created
Line 1: Line 1:
 +===== Handout =====
  
 +To start, first download the assignment:
 +[[http://​alaska.epfl.ch/​~dockermoocs/​bigdata/​stackoverflow.zip|stackoverflow.zip]].
 +For this assignment, you also need to download the data (170 MB):
 +
 +[[http://​alaska.epfl.ch/​~dockermoocs/​bigdata/​stackoverflow.csv|http://​alaska.epfl.ch/​~dockermoocs/​bigdata/​stackoverflow.csv]]
 +
 +and place it in the folder: ''​src/​main/​resources/​stackoverflow''​ in your
 +project directory.
 +
 +As for the last exercise, be sure **not** to include the data files in your repository. To do so, add the following line in the ''​.gitignore''​ file at the root of your repository:
 +
 +  src/​main/​resources/​
 +
 +===== StackOverflow =====
 +
 +The overall goal of this assignment is to implement a distributed k-means algorithm
 +which clusters posts on the popular question-answer platform StackOverflow
 +according to their score. Moreover, this clustering should be executed in parallel
 +for different programming languages, and the results should be compared.
 +
 +The motivation is as follows: StackOverflow is an important source of documentation.
 +However, different user-provided answers may have very different ratings (based on
 +user votes) based on their perceived value. Therefore, we would like to look at the
 +distribution of questions and their answers. For example, how many highly-rated
 +answers do StackOverflow users post, and how high are their scores? Are there big
 +differences between higher-rated answers and lower-rated ones?
 +
 +Finally, we are interested in comparing these distributions for different
 +programming language communities. Differences in distributions could reflect
 +differences in the availability of documentation. For example, StackOverflow could
 +have better documentation for a certain library than that library'​s API
 +documentation. However, to avoid invalid conclusions we will focus on the
 +well-defined problem of clustering answers according to their scores.
 +
 +//Note:// for this assignment, we assume you recall the K-means algorithm introduced earlier in the course. You may refer back to the [[parcon17:​project3|third assignment]] for an overview of the algorithm.
 +
 +==== The Data ====
 +
 +You are given a CSV (comma-separated values) file with information about
 +StackOverflow posts. Each line in the provided text file has the following format:
 +
 +
 +  <​postTypeId>,<​id>,​[<​acceptedAnswer>​],​[<​parentId>​],<​score>,​[<​tag>​]
 +
 +
 +A short explanation of the comma-separated fields follows.
 +
 +
 +**<​postTypeId>:​** ​    Type of the post. Type 1 = question, type 2 = answer.
 +                  ​
 +**<​id>:​** ​            ​Unique id of the post (regardless of type).
 +
 +**<​acceptedAnswer>:​** Id of the accepted answer post. This information is optional, so maybe be missing indicated by an empty string.
 +                  ​
 +**<​parentId>:​** ​      For an answer: id of the corresponding question. For a question:​missing,​ indicated by an empty string.
 +                  ​
 +**<​score>:​** ​         The StackOverflow score (based on user votes).
 +                  ​
 +**<​tag>:​** ​           The tag indicates the programming language that the post is about, in case it's a question, or missing in case it's an answer.
 +
 +
 +You will see the following code in the main class:
 +
 +
 +  val lines   = sc.textFile("​src/​main/​resources/​stackoverflow/​stackoverflow.csv"​)  ​
 +  val raw     = rawPostings(lines)  ​
 +  val grouped = groupedPostings(raw)  ​
 +  val scored ​ = scoredPostings(grouped)  ​
 +  val vectors = vectorPostings(scored)
 +
 +
 +It corresponds to the following steps:
 +
 +  - ''​lines''​ the lines from the csv file as strings
 +  - ''​raw'':​ the raw Posting entries for each line
 +  - ''​grouped'':​ questions and answers grouped together
 +  - ''​scored'':​ questions and scores
 +  - ''​vectors'':​ pairs of (language, score) for each question
 +
 +The first two methods are given to you. You will have to implement the rest.
 +
 +==== Data processing ====
 +
 +We will now look at how you process the data before applying the kmeans algorithm.
 +
 +=== Grouping questions and answers ===
 +
 +The first method you will have to implement is ''​groupedPostings'':​
 +
 +
 +  val grouped = groupedPostings(raw)
 +
 +
 +In the ''​raw''​ variable we have simple postings, either questions or
 +answers, but in order to use the data we need to assemble them together. Questions
 +are identified using a ''​postTypeId == 1''​. Answers to a question with
 +''​id == QID''​ have (a) ''​postTypeId == 2''​ and (b)
 +''​parentId == QID''​.
 +
 +Ideally, we want to obtain an RDD with the pairs of
 +''​(Question,​ Iterable[Answer])''​. However, grouping on the question
 +directly is expensive (can you imagine why?), so a better alternative is to match
 +on the QID, thus producing an ''​RDD[(QID,​ Iterable[(Question,​ Answer))]''​.
 +
 +To obtain this, in the ''​groupedPostings''​ method, first filter the
 +questions and answers separately and then prepare them for a join operation by
 +extracting the QID value in the first element of a tuple. Then, use one of the
 +''​join''​ operations (which one?) to obtain an
 +''​RDD[(QID,​ (Question, Answer))]''​. Then, the last step is to obtain an
 +''​RDD[(QID,​ Iterable[(Question,​ Answer)])]''​. How can you do that, what
 +method do you use to group by the key of a pair RDD?
 +
 +Finally, in the description we made QID, Question and Answer separate types,
 +but in the implementation QID is an ''​Int''​ and both questions and answers
 +are of type ''​Posting''​. Therefore, the signature of
 +''​groupedPostings''​ is:
 +
 +
 +  def groupedPostings(postings:​ RDD[/* Question or Answer */ Posting]): ​
 +      RDD[(/​*QID*/​ Int, Iterable[(/​*Question*/​ Posting, /*Answer*/ Posting)])]
 +
 +
 +This should allow you to implement the ''​groupedPostings''​ method.
 +
 +=== Computing Scores ===
 +
 +Second, implement the ''​scoredPostings''​ method, which should return an
 +RDD containing pairs of (a) questions and (b) the score of the answer with the
 +highest score (note: this does **not** have to be the answer marked as
 +''​acceptedAnswer''​!). The type of this scored RDD is:
 +
 +
 +  val scored: RDD[(Posting,​ Int)] = ???
 +
 +
 +For example, the ''​scored''​ RDD should contain the following tuples:
 +
 +
 +  ((1,​6,​None,​None,​140,​Some(CSS)),​67)
 +  ((1,​42,​None,​None,​155,​Some(PHP)),​89)
 +  ((1,​72,​None,​None,​16,​Some(Ruby)),​3)
 +  ((1,​126,​None,​None,​33,​Some(Java)),​30)
 +  ((1,​174,​None,​None,​38,​Some(C#​)),​20)
 +
 +
 +//Hint:// use the provided ''​answerHighScore''​ given in ''​scoredPostings''​.
 +
 +=== Creating vectors for clustering ===
 +
 +Next, we prepare the input for the clustering algorithm. For this, we transform
 +the ''​scored''​ RDD into a ''​vectors''​ RDD containing the vectors
 +to be clustered. In our case, the vectors should be pairs with two components
 +(in the listed order!):
 +
 +  - Index of the language (in the ''​langs''​ list) multiplied by the ''​langSpread''​ factor.
 +  - The highest answer score (computed above).
 +
 +The ''​langSpread''​ factor is provided (set to 50000). Basically, it makes
 +sure posts about different programming languages have at least distance 50000
 +using the distance measure provided by the ''​euclideanDist''​ function.
 +You will learn later what this distance means and why it is set to this value.
 +
 +The type of the ''​vectors''​ RDD is as follows:
 +
 +
 +  val vectors: RDD[(Int, Int)] = ???
 +
 +
 +For example, the ''​vectors''​ RDD should contain the following tuples:
 +
 +
 +  (350000,67)
 +  (100000,89)
 +  (300000,3)
 +  (50000,30)
 +  (200000,20)
 +
 +
 +Implement this functionality in method ''​vectorPostings''​ and by using
 +the given the ''​firstLangInTag''​ helper method.
 +
 +//(Idea for test: ''​scored''​ RDD should have 2121822 entries)//
 +
 +==== Kmeans Clustering ====
 +
 +
 +  val means = kmeans(sampleVectors(vectors),​ vectors)
 +
 +
 +Based on these initial means, and the provided variables ''​converged''​
 +method, implement the K-means algorithm by iteratively:​
 +
 +  - pairing each vector with the index of the closest mean (its cluster);
 +  - computing the new means by averaging the values of each cluster.
 +
 +To implement these iterative steps, use the provided functions
 +''​findClosest'',​ ''​averageVectors'',​ and ''​euclideanDistance''​.
 +
 +**Note 1:**
 +
 +In our tests, convergence is reached after 44 iterations (for langSpread=50000)
 +and in 104 iterations (for langSpread=1),​ and for the first iterations the
 +distance kept growing. Although it may look like something is wrong, this is the
 +expected behavior. Having many remote points forces the kernels to shift quite a
 +bit and with each shift the effects ripple to other kernels, which also move around,
 +and so on. Be patient, in 44 iterations the distance will drop from over 100000 to
 +13, satisfying the convergence condition.
 +
 +If you want to get the results faster, feel free to downsample the data (each
 +iteration is faster, but it still takes around 40 steps to converge):
 +
 +    val scored ​ = scoredPostings(grouped).sample(true,​ 0.1, 0)
 +
 +However, keep in mind that we will test your assignment on the full data set. So
 +that means you can downsample for experimentation,​ but make sure your algorithm
 +works on the full data set when you submit for grading.
 +
 +**Note 2:**
 +
 +The variable ''​langSpread''​ corresponds to how far away are languages from
 +the clustering algorithm'​s point of view. For a value of 50000, the languages are
 +too far away to be clustered together at all, resulting in a clustering that only
 +takes scores into account for each language (similarly to partitioning the data
 +across languages and then clustering based on the score). A more interesting (but 
 +less scientific) clustering occurs when ''​langSpread''​ is set to 1 (we can'​t ​
 +set it to 0, as it loses language information completely),​ where we cluster according ​
 +to the score. See which language dominates the top questions now?
 +
 +=== Computing Cluster Details ===
 +
 +After the call to kmeans, we have the following code in method ''​main'':​
 +
 +
 +  val results = clusterResults(means,​ vectors)
 +  printResults(results)
 +
 +
 +Implement the ''​clusterResults''​ method, which, for each cluster, computes:
 +
 +  - the dominant programming language in the cluster;
 +  - the percent of answers that belong to the dominant language;
 +  - the size of the cluster (the number of questions it contains);
 +  - the median of the highest answer scores.
 +
 +Once this value is returned, it is printed on the screen by the ''​printResults'' ​
 +method.
 +
 +==== Questions ====
 +
 +  - Do you think that partitioning your data would help?
 +  - Have you thought about persisting some of your data? Can you think of why persisting your data in memory may be helpful for this algorithm?
 +  - Of the non-empty clusters, how many clusters have "​Java"​ as their label (based on the majority of questions, see above)? Why?
 +  - Only considering the "Java clusters",​ which clusters stand out and why?
 +  - How are the "C# clusters"​ different compared to the "Java clusters"?​