建立新闻内容的数据库
News has always been a very significant part of our society. In the past, we mostly depended on the news channels and newspapers to get our feeds and keep ourselves updated. Currently, in the fast-paced world, news media and agencies have started using the internet to reach the readers. The venture has proven to be very helpful as it has allowed the houses to extend their reach among readers.
新闻一直是我们社会非常重要的一部分。 过去,我们主要依靠新闻频道和报纸来获取供稿并保持最新状态。 当前,在快节奏的世界中,新闻媒体和代理机构已开始使用互联网来吸引读者。 事实证明,这项冒险活动非常有帮助,因为它使房屋可以扩大读者的视野。
In the present world, there are numerous media outlets, so, it can be easily established that it is impossible for a person to go and gather news from all the outlets, owing to the busy life schedules. Besides, each media outlet covers each story differently. Some readers like to compare stories and read the same story from multiple houses to get the full idea of an event. All these requirements are solved by a type of application that is gaining popularity currently, Online News Distribution applications. These applications aim to gather news from multiple sources and provide to a user as a feed. In this article, we will look at an approach toward building such an application.
在当今世界上,有许多媒体渠道,因此很容易确定,由于生活繁忙,一个人不可能从所有渠道收集新闻。 此外,每个媒体都以不同的方式报道每个故事。 一些读者喜欢比较故事,并从多个房屋中读取同一故事,以获取事件的完整信息。 所有这些要求都可以通过一种目前正在流行的应用程序(在线新闻发布应用程序)来解决。 这些应用程序旨在从多个来源收集新闻并作为提要提供给用户。 在本文中,我们将研究构建此类应用程序的方法。
The main component of such an application is the news of course. I have used four of the most popular media houses in India for the application, to serve as the sources. All of the media houses possess their own website, from where we scrape the headline links and the stories. We will use the extractive text summarization to extract the gist points from the stories in 3 to 5 sentences. We will store the information collected along with the sources, i.e, the names of the publishing media houses, date, time, and title of the story in datewise files. Each datewise file will give the feed of that particular date.
这种应用程序的主要组成部分当然是新闻。 我使用了印度最受欢迎的四家媒体公司作为该应用程序的资源。 所有媒体公司都拥有自己的网站,我们从中抓取标题链接和故事。 我们将使用摘录文本摘要从3到5个句子中提取故事的要点。 我们将把收集到的信息与来源一起存储,例如,发布媒体公司的名称,日期,时间和故事的标题,保存在按日期排列的文件中。 每个按日期排列的文件都将提供该特定日期的提要。
Now, we can extract another piece of information from the story title, that is the subject of the story. Each title has some relevant information, it may be the name of a person, a country, an organization, or any important topic of that time, for instance, COVID-19. The names or topics are mostly the subjects of the story. We will be extracting these words of interest from the title and we will be using them as labels or tags for the corresponding stories. We will store these labels also along with the titles in the files.
现在,我们可以从故事标题中提取另一条信息,那就是故事的主题。 每个标题都有一些相关信息,它可以是一个人,一个国家,一个组织的名称,或当时的任何重要主题,例如COVID-19。 名称或主题主要是故事的主题。 我们将从标题中提取这些感兴趣的单词,并将它们用作相应故事的标签或标记。 我们还将这些标签以及标题存储在文件中。
An app can be used by many users of different types, so, we must create a filtering or recommender mechanism to customize a user’s feed according to his/her interests. For this, we will need to create a login system, to separately record the type of stories each user reads, and recommend to him/her only based on his/her account. We will be maintaining a database that will contain the user’s name, email, phone number(optional), and password. The email will be our unique key here.
一个应用可以被许多不同类型的用户使用,因此,我们必须创建过滤或推荐机制来根据用户的兴趣来自定义其供稿。 为此,我们将需要创建一个登录系统,以分别记录每个用户阅读的故事类型,并仅根据其帐户向其推荐。 我们将维护一个包含用户名,电子邮件,电话号码(可选)和密码的数据库。 电子邮件将是我们此处的唯一密钥。
We will also be maintaining two JSON files, one to record the stories each user reads and the corresponding labels. In this case, we use the user’s email as the key. The labels will keep telling us the topics the user is interested in. The other file records the users who read a story. In this file, we form a unique key in the format:
我们还将维护两个JSON文件,一个用于记录每个用户阅读的故事以及相应的标签。 在这种情况下,我们使用用户的电子邮件作为密钥。 标签将不断告诉我们用户感兴趣的主题。另一个文件记录了阅读故事的用户。 在此文件中,我们形成以下格式的唯一键:
Publishing House+$+ Publishing Date+$+Story Title
出版社+ $ +出版日期+ $ +故事标题
This unique key will be used as the key in our JSON file. Each key will have the emails of the users who read the story. The idea behind this is, the labels attached in the user’s file to each email will allow us to do content-based recommendations, and if we use both the files together, we can create a full user-item interaction matrix, which can be used to create collaborative filtering based recommendations.
此唯一密钥将用作我们的JSON文件中的密钥。 每个密钥都将包含阅读该故事的用户的电子邮件。 其背后的想法是,用户文件中附加到每封电子邮件的标签将使我们能够进行基于内容的推荐 ,如果我们将两个文件一起使用,则可以创建一个完整的用户项交互矩阵,该矩阵可用于创建基于协作过滤的建议。
Now, we can offer the user three types of distributions of news:
现在,我们可以为用户提供三种新闻发布类型:
Latest Feed: The fresh feed for every day 最新饲料:每天新鲜的饲料 Most Popular stories 最受欢迎的故事 Customized Feed: May contain unvisited feed from the last 2–3 days but will be tuned according to the user’s interests. 自定义的Feed:可能包含最近2-3天未访问的Feed,但会根据用户的兴趣进行调整。One thing worth noticing is the Latest feed is neither tuned nor popular most, still, it is essential, in order to make sure all the stories reach a user and to ensure a bit of randomness, or the entire thing will be too biased. The latest story will be the current date’s feed only. We will use the JSON file containing the records of the emails of all users who visited the story for each story, to obtain the popularity of the story. The popularity of a story is simply the total length of the record of emails for the story.
值得注意的一件事是,最新Feed既不调优也不最受欢迎,这对于确保所有故事都能传达给用户并确保一定的随机性还是至关重要的,否则整个事情都会产生偏差。 最新的故事将仅是当前日期的提要。 我们将使用JSON文件,其中包含访问每个故事的故事的所有用户的电子邮件记录,以获取故事的受欢迎程度。 故事的受欢迎程度只是该故事的电子邮件记录的总长度。
The next thing is we must do is, add a search option. We as readers often want to read about a particular topic. This option will help our users to use the feature.
接下来我们要做的就是添加搜索选项。 作为读者,我们经常想阅读特定主题。 此选项将帮助我们的用户使用该功能。
Lastly, we need to give a “similar stories” option. If we visit an e-commerce site, if we buy a product, it shows us similar products to ease the browsing for the user. We will use a similar feature. If a user selects to read a particular story, we will show him/her similar stories, in order to make his/her experience better.
最后,我们需要给出“类似的故事”选项。 如果我们访问电子商务网站,如果我们购买产品,它会向我们显示类似的产品,以简化用户的浏览。 我们将使用类似的功能。 如果用户选择阅读特定的故事,我们将向他/她显示类似的故事,以使他/她的体验更好。
We have seen the whole idea, now, let’s jump into the application part.
现在,我们已经了解了整个想法,让我们进入应用程序部分。
Let’s first see how the news websites look and how can we easily scrape the required data.
首先让我们看看新闻网站的外观,以及我们如何轻松地抓取所需的数据。
The above image shows the story headlines in red and the corresponding links in the HTML script in the green. We will need to extract the news story links from the code and go to the stories and extract them also.
上图以红色显示了故事标题,以绿色显示了HTML脚本中的相应链接。 我们将需要从代码中提取新闻故事链接,然后转到故事并对其进行提取。
from bs4 import BeautifulSoupimport requestsdef News_18_scraper(): URL="https://www.news18.com/" r=requests.get(URL) #print(r) soup = BeautifulSoup(r.content,'html5lib') #print(soup) heads={} sub= soup.find('div',attrs={'class': 'lead-story'}) #print(sub) rows=sub.findAll('p') #print(rows) for row in rows: head=row.text heads[head]={} heads[head]['Source']='News18' #print(head) #print(row.a["href"]) heads[head]['link']=row.a["href"] sub= soup.find('ul',attrs={'class': 'lead-mstory'}) rows=sub.findAll('li') for row in rows: head=row.text heads[head]={} heads[head]["Source"]='News18' heads[head]["link"]=row.a["href"] return headsThe above piece of code is used to extract the links of the news stories for this particular media house.
上面的代码用于为该特定媒体公司提取新闻报道的链接。
The above image shows how a story webpage looks. It shows the title of the story in green, the story in red, and the story in the source code in blue. We need to scrape all of the required data.
上图显示了故事网页的外观。 它以绿色显示故事的标题,以红色显示故事,在源代码中以蓝色显示故事。 我们需要抓取所有必需的数据。
def extractor_n18(news): for n in news.keys(): #print(n) link=news[n]['link'] r=requests.get(link) #print(link) soup = BeautifulSoup(r.content, 'html5lib') Briefs=[] #print(link) #print(soup) sub=soup.find("title") news[n]['Titles']=[sub.text] tit=sub.text flag=0 try: flag=1 text="" sub=soup.find('div',{'class':'lbcontent paragraph'}) #print(sub) text+=sub.text+"\n" sub_2=soup.find('div',{'id':'article_body'}) text+=sub_2.text summary=summarizer(text) #print(summary) #print(text) except: flag=0 i=1 if flag==0: text="" try: sub=soup.find('article',{'class':'article-content-box first_big_character'}) rows=sub.findAll('p') for row in rows: text+=row.text+"\n" summary=summarizer(text) except: summary=tit #print(summary) news[n]['gists']=summary date=datetime.today().strftime('%Y-%m-%d') time=str(datetime.now().time()) news[n]['Date']=date news[n]['Time']=time return newsThe above code can be used to extract the stories for the news agency.
上面的代码可用于为新闻社提取故事。
I have created my own text summarizer using the Page Rank algorithm.
我已经使用Page Rank算法创建了自己的文本摘要程序。
def pagerank(text, eps=0.000001, d=0.85): score_mat = np.ones(len(text)) / len(text) delta=1 while delta>eps: score_mat_new = np.ones(len(text)) * (1 - d) / len(text) + d * text.T.dot(score_mat) delta = abs(score_mat_new - score_mat).sum() score_mat = score_mat_newreturn score_mat_newThe above code shows the page rank algorithm. I will provide the link to the full code at the end.
上面的代码显示了页面排名算法。 我将在最后提供完整代码的链接。
Now, we have four such news sources. We must individually scrape for all and then compile in a database.
现在,我们有四个这样的新闻来源。 我们必须逐一抓取所有内容,然后在数据库中进行编译。
import pandas as pddef Merge(dict1, dict2, dict3, dict4): res = {**dict1, **dict2, **dict3, **dict4} return resdef file_creater(date): news_times=times_now_scraper() times_now=extract_news_times(news_times) news_rep=republic_tv_scraper() republic_tv=extract_news_rep(news_rep) news_it=india_today_scraper() india_today=extractor_it(news_it) n_18=News_18_scraper() News_18=extractor_n18(n_18) Merged=Merge(times_now,republic_tv,india_today,News_18) Merged_df=pd.DataFrame(Merged) Merged_df_final=Merged_df.transpose() df_final=Merged_df_final.reset_index() df_final_2=df_final.drop(['index'],axis=1) df_final_2.to_csv('feeds/Feed_'+date+'.csv',index=False) get_names('feeds/Feed_'+date+'.csv') return df_final_2The above code obtains all the news together and forms a data csv file for the dates passed.
上面的代码一起获取所有新闻,并为传递的日期形成一个数据csv文件。
The get_names() function extracts the names or topics from the story titles, using the Named Entity Recognition feature of the Spacy Library.
get_names()函数使用Spacy库的命名实体识别功能从故事标题中提取名称或主题。
After the full processing, we obtain a CSV containing the feed file for each date.
经过全面处理后,我们将获得一个CSV文件,其中包含每个日期的供稿文件。
The above images describe how our news file databases look.
上面的图像描述了我们的新闻文件数据库的外观。
Next, we move to the user control parts. It starts with login and signup pages.
接下来,我们转到用户控制部分。 它从登录和注册页面开始。
import pandas as pddef signup(): Name=input("Name:") Email=input("Email:") Phone=input("Phone:") Password=input("Password:") Con_password=input("Confirm Password:") if Con_password!=Password: print("Passwords don't match. Please Retry") signup() df=pd.read_csv('user_data.csv') df_2=df[df['email']==Email] if len(df_2)!=0: print("Email already exists try different email") signup() wr=open('user_data.csv','a') wr.write(Name+","+Email+","+Password+","+Phone+"\n") wr.close() print("Now please log in")def login(): print("1 to Signup, 2 to Login") ch=int(input()) if ch==1: signup() df=pd.read_csv('user_data.csv') Email=input("Email:") Pass=input("Pass:") df_2=df[df['email']==Email] #print(df_2) if len(df_2)==0: print("Email not found, try again") login() if str(df_2.iloc[0]['password'])==Pass: print("Welcome "+df.iloc[0]['Name']) surf(Email) else: print("Password Wrong, try again") login()The above snippet handles login and signup.
上面的代码片段处理登录和注册。
The above image demonstrates the signup portion. It has certain checks like if the email already exists it tells to signup with a different email.
上图演示了注册部分。 它具有某些检查功能,例如电子邮件是否已经存在,它会告诉您使用其他电子邮件进行注册。
The above image shows the structure of the users’ database. Now, let’s take a look at the two JSON file structures.
上图显示了用户数据库的结构。 现在,让我们看一下这两个JSON文件结构。
The first file user_records.json is shown in the above image. As discussed, it shows, we have recorded the news and corresponding labels, visited by the user with email XYZ@gmail.com.
上图显示了第一个文件user_records.json 。 如所讨论的,它表明,我们已经记录了新闻和相应的标签,用户通过电子邮件XYZ@gmail.com访问了它们。
The image shows our second file stories_records.json. As seen earlier, it creates a key and logs the email of the users that visited the story. The length of the lists of visitors provides us the popularity of the story.
该图显示了我们的第二个文件story_records.json。 如前所述,它会创建一个密钥并记录访问该故事的用户的电子邮件。 访客列表的长度为我们提供了故事的人气。
Now, let’s return back to the working of the application.
现在,让我们返回到应用程序的工作。
It shows the working. As soon as we log in, it creates a session with email id and keeps logging the actions against the email id. It provides us with the latest feed and later provides the options as:
它显示了工作原理。 登录后,它将创建一个具有电子邮件ID的会话,并继续记录针对该电子邮件ID的操作。 它为我们提供了最新的提要,后来提供了以下选项:
Searching 正在搜寻 Reading from the feed provided 从提供的提要中阅读 Popular stories 热门故事 Customized stories 定制故事If we want to read from the feed, it tells us to enter the index. It launches the chosen story and also gives us a similar story list to choose from.
如果我们想从提要中读取内容,它会告诉我们输入索引。 它启动了选定的故事,还为我们提供了类似的故事列表可供选择。
For similar stories, I have just ranked the story titles based on the cosine similarity with the chosen story title after a bit of preprocessing. One thing to keep in mind is, we will be only using the feeds for the last 3 consecutive days, i.e, if the user is using the app on the 4th, our feed will have data from 2nd to 4th. This will prevent our application to show super old feeds and also reduce computation.
对于相似的故事,经过一些预处理,我已经根据余弦相似性和所选故事标题对故事标题进行了排名。 要记住的一件事是,我们将仅使用最近连续三天的提要,即,如果用户在4日使用该应用程序,则我们的提要将具有从2日到4日的数据。 这将阻止我们的应用程序显示超旧的提要,并减少计算量。
from nltk.tokenize import sent_tokenize, word_tokenize def clean_sentence(sentence): #extracts=sent_tokenize(article) sentences=[] #print(extract) clean_sentence=sentence.replace("[^a-zA-Z0-9]"," ") ## Removing special characters #print(clean_sentence) obtained=word_tokenize(clean_sentence) #print(obtained) sentences.append(obtained)return sentenceimport nltknltk.download('punkt')nltk.download('stopwords')from nltk.cluster.util import cosine_distancedef get_similarity(sent_1,sent_2,stop_words): sent_1=[w.lower() for w in sent_1] sent_2=[w.lower() for w in sent_2]total=list(set(sent_1+sent_2)) ## Removing duplicate words in total setvec_1= [0] * len(total) vec_2= [0] * len(total)## Count Vectorization of two sentences for w in sent_1: if w not in stop_words: vec_1[total.index(w)]+=1for w in sent_2: if w not in stop_words: vec_2[total.index(w)]+=1return 1-cosine_distance(vec_1,vec_2)The above codes are used to preprocess the data and obtain the cosine similarity. The codes remove any special character, convert everything to lowercase, and also it removes the Stop Words.
以上代码用于预处理数据并获得余弦相似度。 该代码删除任何特殊字符,将所有内容都转换为小写,并且还删除停用词。
Next, we move to the searching portion. The users enter a topic, we will pick the topic and again obtain the cosine similarities with the titles of the stories individually, and sort them in non-increasing order to get the search results. We may have used the labels, but they are not manually extracted so, it may cause the performances to decrease.
接下来,我们转到搜索部分。 用户输入一个主题,我们将选择该主题,然后再次获得与故事标题的余弦相似度,并以不增加的顺序对其进行排序,以获取搜索结果。 我们可能使用过这些标签,但是它们不是手动提取的,因此可能会导致性能下降。
def search(email,df): clear() search=input("search") df_temp=df sim=[] for i in range(len(df)): try: title=ast.literal_eval(df.iloc[i]['Titles'])[0] cleaned_1=clean_sentence(search) cleaned_2=clean_sentence(title) stop_words = stopwords.words('english') s=get_similarity(cleaned_1,cleaned_2,stop_words) if s<0.95: sim.append(s) else: sim.append(0) except: sim.append(0) df_temp['Sim']=sim df_temp.sort_values(by=['Sim'], inplace=True,ascending=False) #print(df_temp.head()) print("\n\n Top 5 Results \n") for i in range(5): res = ast.literal_eval(df_temp.iloc[i]['Titles']) print(str(i+1)+"-> "+res[0]) print(df_temp.iloc[i]['Source']+" , "+df_temp.iloc[i]['Date']) print('\n\n') ind=int(input("Please Provide the index of the story")) #print(str(stories_checked.iloc[indices[ind-1]]['link'])) #ind=int(input("Please Provide the index of the story")) webbrowser.open(df_temp.iloc[ind-1]['link']) time.sleep(3) try: file_u = open('user_records.json') users=json.load(file_u)if email not in users.keys(): users[email]={} users[email]['news']=[df_temp[ind-1]['Source']+df_temp.iloc[ind-1]['Date']+ast.literal_eval(df_temp.iloc[ind-1]['Titles'])[0]] lab=[z for z in ast.literal_eval(df_temp.iloc[ind-1]['labels'])] users[email]['labels']=lab else: users[email]['news'].append(df_temp.iloc[ind-1]['Source']+df_temp.iloc[ind-1]['Date']+ast.literal_eval(df_temp.iloc[ind-1]['Titles'])[0]) lab=[z for z in ast.literal_eval(df_temp.iloc[ind-1]['labels'])] for l in lab: users[email]['labels'].append(l) with open("user_records.json", "w") as outfile: json.dump(users, outfile) file_s = open('story_records.json') stories=json.load(file_s) key=df_temp.iloc[ind-1]['Source']+df_temp.iloc[ind-1]['Date']+ast.literal_eval(df_temp.iloc[ind-1]['Titles'])[0]if key not in stories.keys():stories[key]=[email] else: stories[key].append(email)with open("story_records.json", "w") as outfile: json.dump(stories, outfile)The above code is used for searching. The function takes in the email and the news database in which we have to search.
上面的代码用于搜索。 该功能接收我们必须在其中搜索的电子邮件和新闻数据库。
The above image shows the search feature. If we search COVID-19, the application gives us 5 top matches for COVID-19 with the media house and the publication date.
上图显示了搜索功能。 如果我们搜索COVID-19,则应用程序会为我们提供COVID-19的前5个匹配项,以及媒体公司和发布日期。
Content-Based Filtering
基于内容的过滤
We are not going to use Content Based filtering fully. We are just going to use the idea behind the approach. We will pick up the labels visited by the user from the JSON files. We will consider only the last 20 labels visited because if we consider more the recommendations will not shift with the user’s shift in interests. Next, we will compare the overlap between the labels viewed by the user and the labels of the stories individually, and will recommend the top 10 overlaps. Now, one thing to notice that, we won't be showing the stories that the user has already seen, we will externally make the overlap 0 to prevent this from happening. We can get the information from the story_records.json file.
我们不会完全使用基于内容的过滤。 我们将使用该方法背后的想法。 我们将从JSON文件中提取用户访问的标签。 我们将仅考虑访问的最后20个标签,因为如果考虑更多,推荐将不会随用户兴趣的变化而变化。 接下来,我们将比较用户查看的标签和故事的标签之间的重叠,并推荐前10个重叠。 现在,有一件事要注意,我们将不会显示用户已经看过的故事,我们将在外部设置重叠0以防止这种情况发生。 我们可以从story_records.json文件中获取信息。
To check the similarity of the story labels, we will use cosine similarity again. One thing to note is each time we are using count vectorization method to obtain the similarity.
为了检查故事标签的相似性,我们将再次使用余弦相似性。 需要注意的一件事是,每当我们使用计数向量化方法来获得相似性时。
The image shows the applications of the customized story feature.
该图显示了定制故事功能的应用程序。
This almost sums up the entire application description and the application.
这几乎总结了整个应用程序描述和应用程序。
The above video provides a short demo of the application.
上面的视频提供了该应用程序的简短演示。
We have seen how we can develop an online news distribution application.
我们已经看到了如何开发在线新闻发布应用程序。
The github link is here.
github链接在这里 。
Hope this helps.
希望这可以帮助。
翻译自: https://medium.com/the-innovation/an-approach-to-build-an-online-news-distribution-system-acda2aa8059b
建立新闻内容的数据库
相关资源:jsp新闻发布系统创建数据库,数据表代码齐全