Use python to do clustering and topic modeling

[ad_1]
In this assignment, you’ll need to use the following dataset:text_train.json: This file contains a list of documents. It’s used for training modelstext_test.json: This file contains a list of document and labels of each document. It’s used for testing performance. This file is in the format shown below. Note, each document has a list of labels.TextLabelsfaa issues fire warning for lithium …[T1, T3]rescuers pull from flooded coal mine …[T1]…….Q1: K-Mean ClusteringDefine a function cluster_kmean() as follows:Takes two file name strings as inputs: train_f ile is the file path of text_train.json, and test_f ile is the file path of text_test.jsonUses KMeans to cluster all documents in both train_f ile and test_f ile into 3 clusters by cosine similarity. Note, please combine documents in these two files and train a single clustering model from the combined documents.Tests the clustering model performance using test_f ile :Let’s only use the first label in the label list of each test document as the ground_truth label, e.g. the first document in the table above will have the ground_truth label “T1”. Apply majority vote rule to map the clusters to the labels in test_f ile , i.e., T1, T2, T3Calculate precision/recall/f-score for each labelCheck centroids/samples in each cluster to interpret it, and give a meaningful name (instead of T1, T2, T3) to it.This function has no return. Print out precision/recall/f-score. Write down the meaningful cluster names in a document. Also find one document sample from train_f ile for each cluster in the doucment.Q2: LDA ClusteringDefine a function cluster_lda() as follows:Takes two file name strings as inputs: train_f ile is the file path of text_train.json, and test_f ileis the file path of text_test.jsonUses LDA to train a topic model with only documents in train_f ile and the number of topics K = 3Predicts the topic distribution of each document in(i.e. the topic with highest probability)Evaluates the topic model performance using topic prediction from documents in test_f ile :Let’s use the first label in the label list of each test document as the ground_truth label,e.g. the first document in the table above will have the ground_truth label “T1”.Apply majority vote rule to map the topics to the labels in test_f ile , i.e., T1, T2,T3 Calculate precision/recall/f-score for each labelBased on the word distribution of each topic, give the topic a meaningful name(instead of T1, T2, T3).This function has no return. Print out precision/recall/f-score. Also, provide a document whichcontains:the meaningful topic namesone document sample from train_f ile for each topicperformance comparison between Q1 and Q2.test_f ile , and selects only the top one topic
[ad_2]Source link
 
“Looking for a Similar Assignment? Get Expert Help at an Amazing Discount!”

What Students Are Saying About Us

.......... Customer ID: 12*** | Rating: ⭐⭐⭐⭐⭐
"Honestly, I was afraid to send my paper to you, but splendidwritings.com proved they are a trustworthy service. My essay was done in less than a day, and I received a brilliant piece. I didn’t even believe it was my essay at first 🙂 Great job, thank you!"

.......... Customer ID: 14***| Rating: ⭐⭐⭐⭐⭐
"The company has some nice prices and good content. I ordered a term paper here and got a very good one. I'll keep ordering from this website."

"Order a Custom Paper on Similar Assignment! No Plagiarism! Enjoy 20% Discount"