(موتور جستجو با کتابخانه لوسین )Implement the baseline Lucene system for the News data

شش سال پیش منتشر شده

تعداد بازدید: 1009

کد پروژه: 90737

شرح پروژه

Implement the baseline Lucene system for the News data:
This web site may help you: http://www.lucenetutorial.com/index.html

A. Preparation

Download the current release Lucene distribution from https://lucene.apache.org/core/downloads.html
Test your distribution by following instructions in “Indexing Files” at https://lucene.apache.org/core/5_4_0/demo/overview-summary.html
Read and understand (enough to modify) the demo source code, mainly

IndexFiles.java (https://lucene.apache.org/core/5_4_0/demo/src-html/org/apache/lucene/demo/IndexFiles.html): code to create a Lucene index.
SearchFiles.java (https://lucene.apache.org/core/5_4_0/demo/src-html/org/apache/lucene/demo/SearchFiles.html): code to search a Lucene index

Now you are ready to write your own indexer for the news data:

You may start with the demo IndexFiles.java code
However: You will need to write your own document parser to process the files (to detect individual document boundaries within each file
Warning: Do not copy the input documents; Rather give the original location of the text directory as input to indexer and store the index in your own project directory

Make sure you index both the filename and the document id (look at one of the document in the collection to see the difference).
Pay attention to the tokenizer (analyzer in Lucene lingo) used. You will have to use the same one at search time.
Test your index with the modified SearchFiles.java program (hint: just make sure you can tell it where your index is, the rest can stay the same).
B.
B.1. Implement the TF*IDF ranking:
Use Lucene ranking algorithm for finding top 10 documents
Implement TF-IDF ranking in Lucene and find top 10 documents.
Implement TF-IDF with different formulas for query and document.

B.2. Variations of TF*IDF

1.     Think about two more variations of TF*IDF, try to implement them and test them

C. Evaluation
Trec_eval
1)     For the qrel provided in VU
a)      Create an appropriate format that is readable by trec_eval
2)     Run queries over dataset
3)     Output: for each topic, return up to 10 documents, one result per line, tab-delimited, in following format
topic_id \t Q0 \t document_id \t rank \t score \t your_login
e.g.,
201 \t Q0 \t FBIS-41571 \t 143 \t 101.24 \t eugene
4)     Use trec_eval to find measures such as MAP and P@5 and the pther measures
5)     Compare your algorithms

Documents: http://ir.dcs.gla.ac.uk/resources/test_collections/cran/
cran.all - The documents
cran.qry - The queries
cranqrel - The relevance assesments
readme - Some attempt at explanation especially about the relevance judgements

D. Report
1.     Prepare a report including at least 10 pages (single line, size 12) including the following parts:
a.      Introduction
  i.      A summary of what you have done and your findings
b.      Indexing
  i.      Create four indexes: with stop words, and without stop words, with stemming, without stemming. Explain the code, the difference in the size of indexes, the difference in the memory and the speed.

c.       Ranking Method
  i.      An explanation of the implementation of the original TF-IDF, TF_IDF with different settings for query and documents, and your TWO new versions of TF-IDF
1.     You need to clearly justify your ideas, (you may used references to academic papers)
d.      Evaluation
  i.      Tables and diagrams showing MAP, P@5, P@10 NDCG for all different ranking methods
  ii.      Explain why your methods improve/worsen the original methods. Your data must be statistically significant (http://www.statisticssolutions.com/manova-analysis-paired-sample-t-test/, https://www.graphpad.com/quickcalcs/ttest1.cfm )
e.       Conclusion

این پروژه شامل 1 فایل مهم است، لطفا قبل از ارسال پیشنهاد حتما نسبت به بررسی این فایل اقدام فرمایید.

مهارت ها و تخصص های مورد نیاز

دات‌نت (.NET) برنامه نویسی با C# (C# Programming) جاوا (Java) PHP

مهلت برای انجام

3روز

وضعیت مناقصه

بسته

درباره کارفرما

کاربر103464

عضویت شش سال پیش

3 پروژه ثبت شده ،

0 پروژه در حال انجام ،

0 پروژه آماده دریافت پیشنهاد ،

نرخ پذیرش پیشنهاد 0%

برای پیدا کردن پروژه‌های مشابه ثبت نام کنید و پروفایل خود را بسازید.

ورود با گوگل

یا

نیاز به استخدام فریلنسر یا سفارش پروژه مشابه دارید؟

سفارش پروژه مشابه

روش کار در پارس‌کدرز

به رایگان یک حساب کاربری بسازید

مهارت‌ها و تخصص‌های خود را ثبت کنید، رزومه و نمونه‌کارهای خود را نشان دهید و سوابق کاری خود را شرح دهید.

به شیوه‌ای که دوست دارید کار کنید

برای پروژه‌های دلخواه در زمان دلخواه پیشنهاد قیمت خود را ثبت کنید و به فرصت‌های شغلی منحصر به فرد دسترسی پیدا کنید.

با اطمینان دستمزد دریافت کنید

از زمان شروع کار تا انتهای کار به امنیت مالی شما کمک خواهیم کرد. وجه پروژه را از ابتدای کار به امانت در سایت نگه خواهیم داشت تا تضمین شودکه بعد از تحویل کار دستمزد شما پرداخت خواهد شد.

می‌خواهید شروع به کار کنید؟

یک حساب کاربری بسازید

بهترین مشاغل فریلنسری را پیدا کنید
رشد شغلی شما به راحتی ایجاد یک حساب کاربری رایگان و یافتن کار (پروژه) متناسب با مهارت‌های شما است.

پیدا کردن کار (پروژه)

تماشای دمو روش کار

پارس‌کدرز چگونه کار می‌کند؟

پارس‌کدرز خریداران یا کارفرمایان را به مجری‌ها /فریلنسرهای خبره‌ای متصل می‌کند که برای انجام پروژه آماده هستند.

(موتور جستجو با کتابخانه لوسین )Implement the baseline Lucene system for the News data

برای پیدا کردن پروژه‌های مشابه ثبت نام کنید و پروفایل خود را بسازید.

نیاز به استخدام فریلنسر یا سفارش پروژه مشابه دارید؟

سری به پروژه‌های مشابه بزنید

روش کار در پارس‌کدرز