Anakeyn TF-IDF Keywords Suggest est un outil de suggestion de mots clés pour le SEO et le Web Marketing.
Cet outil récupère les x premières pages Web répondant à une requête dans Google.
Ensuite, le système va récupérer le contenu des pages afin de trouver des mots clés populaires ou originaux en rapport avec le sujet recherché. Le système fonctionne avec un algorithme TF-IDF.
TF-IDF veut dire term frequency–inverse document frequency en anglais. C’est une mesure statistique qui permet de déterminer l’importance d’un mot ou d’une expression dans un document relativement à un corpus ou une collection de documents.
Dans notre cas un document est une page Web et le corpus est l’ensemble des pages Web récupérées.
Comme ici nous avons besoin d’indicateurs « généraux » pour les expressions trouvées, nous calculons la moyenne des TF-IDF pour chaque expression pour trouver les plus populaires, et la moyenne des TF-IDF non nulles pour chaque expression pour déterminer les plus originales.
L’application est développée en Python (pour nous Python Anaconda 3.7) au format Web.
Pour le format Web nous utilisons Flask :
Flask est un cadre de travail (framework) Web pour Python. Ainsi, il fournit des fonctionnalités permettant de construire des applications Web, ce qui inclut la gestion des requêtes HTTP et des canevas de présentation ou templates. Flask a été créé initialement par Armin Ronacher
Flask fonctionne notamment avec le moteur de templates Jinja créé par le même auteur et que nous utilisons dans ce projet.
Pour la gestion de la base de données nous utilisons SQLAlchemy. SQLAlchemy est un ORM (Object Relationnel Model, Modèle Relationnel Objet) qui permet de stocker les données des objets Python dans une représentation de type base de données sans se préoccuper des détails liés à la base.
SQLAlchemy est compatible avec PostgreSQL, MySQL, Oracle, Microsoft SQL Server… Pour l’instant nous l’utilisons avec SQLite qui est fournie par défaut avec Python.
Structure de l’application
l’application est structurée de la façon suivante :
database.db : base de données SQLite créée automatiquement lors de la première utilisation.
favicon.ico : favicon anakeyn.
tfidfkeywordssuggest.py : programme principal (celui à faire tourner).
license.txt : texte de la licence GPL 3.
myconfig.py : programme contenant les paramètres de configuration de départ et notamment le type et le nom de la base de données ainsi que les 2 utilisateurs par défaut : admin (mot de passe « adminpwd ») et guest (mot de passe « guestpwd »)
requirements.txt : liste des bibliothèque Python à installer.
__init__.py : marque ce dossier comme un package importable.
Répertoire configdata : il contient 2 fichiers de configuration :
tldlang.xlsx : paramètres pour les domaines de premier niveau des différents sites Google (en anglais Top Level Domain exemple : .com, .fr, .co.uk …) et les langues de résultats souhaitées. Il y a 358 combinaisons de domaines de premier niveau / langues, ce qui permet de rechercher des mots clés dans de nombreuses langues et de nombreux pays.
user_agents-taglang.txt : une liste d’agents utilisateurs ou user agents (par exemple un navigateur Web) valides. Ces agents sont utilisés de façon aléatoire afin d’éviter d’être reconnu par Google trop rapidement et d’être bloqué. Un tag « {{taglang}} » (au format Jinja) permet au programme de paramétrer la langue cible dans l’agent utilisateur.
Le répertoire static contient les images et les fichiers .css (feuilles de style en cascade en angais Cascading Style Sheets) de l’application Web.
le répertoire templates contient les pages html au format Jinja
le répertoires uploads sert à enregistrer les fichiers de mots clés trouvés. Un sous répertoire est créé pour chaque utilisateur.
Pour chaque interrogation, le système crée 7 fichiers de mots clés/expression « populaires » : 1 avec des tailles indéterminées entre 1 à 6 mots et 1 pour chaque taille en nombre de mots : 1, 2, 3, 4, 5 ou 6 mots. DE la même façon, il crée 7 fichiers pour les mots clés « originaux ». Si l’ensemble du corpus de documents est suffisamment grand, le système propose jusqu’à 10.000 mots clés par fichier !
Anaconda installe plusieurs outils sur votre ordinateur :
Outils Anaconda
Ouvrez Anaconda Prompt et allez dans le répertoire ou vous avez préalablement installé l’application. Par exemple sous Windows : « >cd c:Users\myname\documents\…\… »
Vérifiez que le fichier « requirements.txt » est bien dans votre répertoire : dir (Windows), ls (Linux). Ce fichier contient la liste des bibliothèques (ou dépendances) à installer au préalable pour faire fonctionner notre application.
Pour les installer sous Linux tapez la commande suivante :
while read requirement; do conda install --yes $requirement || pip install $requirement; done < requirements.txt
Pour les installer sous Windows tapez la commande suivante :
FOR /F "delims=~" %f in (requirements.txt) DO conda install --yes "%f" || pip install "%f"
Anaconda Prompt
Attention !!! la mise en place des dépendances peut durer un certain temps : soyez patient !! Cette opération n’est à faire qu’une seule fois.
Ensuite lancez Spyder et ouvrez le fichier Python Principal tfidfkeywordssuggest.py
Spyder keywordssuggest.py
Vérifiez que vous êtes dans le bon répertoire (celui de votre fichier python) et cliquez sur la flèche verte pour lancer le fichier Python
Ensuite, ouvrez un navigateur et allez à l’adresse http://127.0.0.1:5000. Il s’agit de l’adresse par défaut de l’application sur votre ordinateur.
Accueil Anakeyn Keywords Suggest
Cliquez sur « TF-IDF Keywords Suggest » : le système est protégé par identifiant et mot de passe. Par défaut pour l’administrateur : admin, adminpwd et pour l’invité : guest, guestpwd.
Ensuite choisissez un mot clé ou une expression et le couple pays/langue ciblé :
Recherche sur Anakeyn Keywords Suggest
Le système va rechercher dans Google les x (à déterminer dans le fichier de configuration) premières pages répondant au mot clé recherché, les sauvegarder, récupérer les contenus et calculer un TF-IDF pour chaque terme trouvé dans les pages. Ensuite il fournira 14 fichiers de résultats avec jusqu’à 10.000 expressions populaires ou originales.
Résultats Anakeyn Keywords Suggest
Comme vous pouvez le voir, toutes les langues ne sont pas filtrées par Google. Voir la liste pour le paramètre « lr » sur cette page https://developers.google.com/custom-search/docs/xml_results_appendices#lrsp . Toutefois avec le filtre de pays et la langue indiquée dans le user agent on obtient souvent des résultats satisfaisants.
Par exemple, ci dessous le début du fichier des résultats de la recherche pour « SEO », contenant les expressions originales de 2 mots clés et ciblés en Swahili pour la République Démocratique du Congo :
Résultats Swahili République Démocratique du Congo pour « SEO »
Code Source
nous présenterons ici les codes sources en Python myconfig.py et tfidfkeywordssuggest.py ainsi que la template tfidfkeywordssuggest.html.
#max number of results to keep in TF-IDF depending on role
myMaxResults=[10000, 10000, 5000, 1000, 100]
#max searches by day depending on role
myMaxSearchesByDay=[10000, 10000, 1000, 100, 10]
#min ngram
myMinNGram=1#Not a good idea to change this for the moment
#max ngram
myMaxNGram=6##Not a good idea to change this for the moment
#CSV separator
myCsvSep = ","
# -*- coding: utf-8 -*-
"""
Created on Tue Jul 16 15:41:05 2019
@author: Pierre
"""
import pandas as pd #for dataframes
#Define your database
myDatabaseURI = "sqlite:///database.db"
#define the default upload/dowload parent directory
UPLOAD_SUBDIRECTORY = "/uploads"
#define an admin and a guest
#change if needed.
myAdminLogin = "admin"
myAdminPwd = "adminpwd"
myAdminEmail = "admin@example.com"
myGuestLogin = "guest"
myGuestPwd = "guestpwd"
myGuestEmail = "guest@example.com"
#define Google TLD and languages and stopWords
#see https://developers.google.com/custom-search/docs/xml_results_appendices
#https://www.metamodpro.com/browser-language-codes
#Languages
dfTLDLanguages = pd.read_excel('configdata/tldLang.xlsx')
dfTLDLanguages.fillna('',inplace=True)
if len(dfTLDLanguages) == 0 :
#if 1==1 :
data = [['google.com', 'United States', 'com', 'en', 'lang_en', 'countryUS', 'en-us', 'english', 'United States', 'en - English'],
['google.co.uk', 'United Kingdom', 'co.uk', 'en', 'lang_en', 'countryUK', 'en-uk', 'english', 'United Kingdom', 'en - English'],
['google.de', 'Germany', 'de','de', 'lang_de', 'countryDE', 'de-de', 'german', 'Germany', 'de - German'],
['google.fr', 'France', 'fr', 'fr', 'lang_fr', 'countryFR', 'fr-fr', 'french', 'France', 'fr - French']]
dfTLDLanguages = pd.DataFrame(data, columns = ['tldLang', 'description', 'tld', 'hl', 'lr', 'cr', 'userAgentLanguage', 'stopWords', 'countryName', 'ISOLanguage' ])
myTLDLang = [tuple(r) for r in dfTLDLanguages[['tldLang', 'description']].values]
#print(myTLDLang)
dfTLDLanguages = dfTLDLanguages.drop_duplicates()
dfTLDLanguages.set_index('tldLang', inplace=True)
dfTLDLanguages.info()
#dfTLDLanguages['userAgentLanguage']
#USER AGENTS
with open('configdata/user_agents-taglang.txt') as f:
userAgentsList = f.read().splitlines()
#pauses to scrap Google
myLowPause=2
myHighPause=6
#define max pages to scrap on Google (30 is enough 100 max)
myMaxPagesToScrap=30
#define refresh delay (usually 30 days)
myRefreshDelay = 30
myMaxFeatures = 10000 #to calculate tf-idf
#Name of account type
myAccountTypeName=['Admin', 'Gold', 'Silver','Bronze', 'Guest',]
#max number of results to keep in TF-IDF depending on role
myMaxResults=[10000, 10000, 5000, 1000, 100]
#max searches by day depending on role
myMaxSearchesByDay=[10000, 10000, 1000, 100, 10]
#min ngram
myMinNGram=1 #Not a good idea to change this for the moment
#max ngram
myMaxNGram=6 ##Not a good idea to change this for the moment
#CSV separator
myCsvSep = ","
# -*- coding: utf-8 -*-
"""
Created on Tue Jul 16 15:41:05 2019
@author: Pierre
"""
import pandas as pd #for dataframes
#Define your database
myDatabaseURI = "sqlite:///database.db"
#define the default upload/dowload parent directory
UPLOAD_SUBDIRECTORY = "/uploads"
#define an admin and a guest
#change if needed.
myAdminLogin = "admin"
myAdminPwd = "adminpwd"
myAdminEmail = "admin@example.com"
myGuestLogin = "guest"
myGuestPwd = "guestpwd"
myGuestEmail = "guest@example.com"
#define Google TLD and languages and stopWords
#see https://developers.google.com/custom-search/docs/xml_results_appendices
#https://www.metamodpro.com/browser-language-codes
#Languages
dfTLDLanguages = pd.read_excel('configdata/tldLang.xlsx')
dfTLDLanguages.fillna('',inplace=True)
if len(dfTLDLanguages) == 0 :
#if 1==1 :
data = [['google.com', 'United States', 'com', 'en', 'lang_en', 'countryUS', 'en-us', 'english', 'United States', 'en - English'],
['google.co.uk', 'United Kingdom', 'co.uk', 'en', 'lang_en', 'countryUK', 'en-uk', 'english', 'United Kingdom', 'en - English'],
['google.de', 'Germany', 'de','de', 'lang_de', 'countryDE', 'de-de', 'german', 'Germany', 'de - German'],
['google.fr', 'France', 'fr', 'fr', 'lang_fr', 'countryFR', 'fr-fr', 'french', 'France', 'fr - French']]
dfTLDLanguages = pd.DataFrame(data, columns = ['tldLang', 'description', 'tld', 'hl', 'lr', 'cr', 'userAgentLanguage', 'stopWords', 'countryName', 'ISOLanguage' ])
myTLDLang = [tuple(r) for r in dfTLDLanguages[['tldLang', 'description']].values]
#print(myTLDLang)
dfTLDLanguages = dfTLDLanguages.drop_duplicates()
dfTLDLanguages.set_index('tldLang', inplace=True)
dfTLDLanguages.info()
#dfTLDLanguages['userAgentLanguage']
#USER AGENTS
with open('configdata/user_agents-taglang.txt') as f:
userAgentsList = f.read().splitlines()
#pauses to scrap Google
myLowPause=2
myHighPause=6
#define max pages to scrap on Google (30 is enough 100 max)
myMaxPagesToScrap=30
#define refresh delay (usually 30 days)
myRefreshDelay = 30
myMaxFeatures = 10000 #to calculate tf-idf
#Name of account type
myAccountTypeName=['Admin', 'Gold', 'Silver','Bronze', 'Guest',]
#max number of results to keep in TF-IDF depending on role
myMaxResults=[10000, 10000, 5000, 1000, 100]
#max searches by day depending on role
myMaxSearchesByDay=[10000, 10000, 1000, 100, 10]
#min ngram
myMinNGram=1 #Not a good idea to change this for the moment
#max ngram
myMaxNGram=6 ##Not a good idea to change this for the moment
#CSV separator
myCsvSep = ","
Paramètres importants :
myDatabaseURI : vous pouvez modifier cette variable si vous souhaitez une base de données mySQL (mysql://scott:tiger@localhost/mydatabase) ou bien Postgres (postgresql://scott:tiger@localhost/mydatabase)
myAdminLogin, myAdminPwd … si vous souhaitez modifier les identifiants/mots de passe des utilisateurs. (une gestion des utilisateurs est prévue dans le futur)
myRefreshDelay : délai entre 2 lectures sur Google. Nous avons pris 30 jours car souvent les sources alternatives de positions comme par exemple Yooda Insight ou SEMrush sont mis à jour mensuellement. Rem: comme nous prévoyons d’utiliser leurs APIs dans le futur on aura ainsi un délai cohérent entre les différentes sources. Cela permet aussi d’être plus rapide et d’éviter de la consultation Google pour rien.
myCsvSep : par défaut nous avons mis une « , » comme séparateur pour les fichiers de résultats. Mais si vous avez une version française d’Excel vous pouvez choisir « ; » pour pouvoir ouvrir ces fichiers.
TFIDFkeywordssuggest.py : programme principal
Ici, nous allons diviser le code source en plusieurs morceaux pour le présenter.
from sklearn.feature_extraction.text import TfidfVectorizer
import requests #to read urls contents
from bs4 import BeautifulSoup
from bs4.element import Comment
import re #for regex
import unicodedata #to decode accents
import os #for directories
import sys #for sys variables
# -*- coding: utf-8 -*-
"""
Created on Tue Jul 16 15:41:05 2019
@author: Pierre
"""
#############################################################
# Anakeyn Keywords Suggest version Alpha 0.0
# Anakeyn Keywords Suggest is a keywords suggestion tool.
# This tool searches in the first pages responding to a given keyword in Google. Next the
# system will get the content of the pages in order to find popular and original keywords
# in the subject area. The system works with a TF-IDF algorithm.
#############################################################
#Copyright 2019 Pierre Rouarch
# License GPL V3
#############################################################
#see also
#https://github.com/PrettyPrinted/building_user_login_system
#https://blog.miguelgrinberg.com/post/the-flask-mega-tutorial-part-i-hello-world
#https://github.com/MarioVilas/googlesearch #googlesearch serp scraper
############### FOR FLASK ###############################
#conda install -c anaconda flask
from flask import Flask, render_template, redirect, url_for, Response, send_file
#from flask import session #for the sessions variables
#from flask_session import Session #for sessions
#pip install flask-bootstrap #if not installed in a console
from flask_bootstrap import Bootstrap #to have a responsive design with fmask
from flask_wtf import FlaskForm #forms
from wtforms import StringField, PasswordField, BooleanField, SelectField #field types
from wtforms.validators import InputRequired, Email, Length #field validators
from flask_sqlalchemy import SQLAlchemy
from werkzeug.security import generate_password_hash, check_password_hash
from flask_login import LoginManager, UserMixin, login_user, login_required, logout_user, current_user
import time
from datetime import datetime, date #, timedelta
############## For other Functionalities
import numpy as np #for vectors and arrays
import pandas as pd #for dataframes
#pip install google #to install Google Searchlibrary by Mario Vilas
#https://python-googlesearch.readthedocs.io/en/latest/
import googlesearch #Scrap serps
#to randomize pause
import random
#
import nltk # for text mining
from nltk.corpus import stopwords
nltk.download('stopwords') #for stopwords
#print (stopwords.fileids())
#TF-IDF function
from sklearn.feature_extraction.text import TfidfVectorizer
import requests #to read urls contents
from bs4 import BeautifulSoup
from bs4.element import Comment
import re #for regex
import unicodedata #to decode accents
import os #for directories
import sys #for sys variables
# -*- coding: utf-8 -*-
"""
Created on Tue Jul 16 15:41:05 2019
@author: Pierre
"""
#############################################################
# Anakeyn Keywords Suggest version Alpha 0.0
# Anakeyn Keywords Suggest is a keywords suggestion tool.
# This tool searches in the first pages responding to a given keyword in Google. Next the
# system will get the content of the pages in order to find popular and original keywords
# in the subject area. The system works with a TF-IDF algorithm.
#############################################################
#Copyright 2019 Pierre Rouarch
# License GPL V3
#############################################################
#see also
#https://github.com/PrettyPrinted/building_user_login_system
#https://blog.miguelgrinberg.com/post/the-flask-mega-tutorial-part-i-hello-world
#https://github.com/MarioVilas/googlesearch #googlesearch serp scraper
############### FOR FLASK ###############################
#conda install -c anaconda flask
from flask import Flask, render_template, redirect, url_for, Response, send_file
#from flask import session #for the sessions variables
#from flask_session import Session #for sessions
#pip install flask-bootstrap #if not installed in a console
from flask_bootstrap import Bootstrap #to have a responsive design with fmask
from flask_wtf import FlaskForm #forms
from wtforms import StringField, PasswordField, BooleanField, SelectField #field types
from wtforms.validators import InputRequired, Email, Length #field validators
from flask_sqlalchemy import SQLAlchemy
from werkzeug.security import generate_password_hash, check_password_hash
from flask_login import LoginManager, UserMixin, login_user, login_required, logout_user, current_user
import time
from datetime import datetime, date #, timedelta
############## For other Functionalities
import numpy as np #for vectors and arrays
import pandas as pd #for dataframes
#pip install google #to install Google Searchlibrary by Mario Vilas
#https://python-googlesearch.readthedocs.io/en/latest/
import googlesearch #Scrap serps
#to randomize pause
import random
#
import nltk # for text mining
from nltk.corpus import stopwords
nltk.download('stopwords') #for stopwords
#print (stopwords.fileids())
#TF-IDF function
from sklearn.feature_extraction.text import TfidfVectorizer
import requests #to read urls contents
from bs4 import BeautifulSoup
from bs4.element import Comment
import re #for regex
import unicodedata #to decode accents
import os #for directories
import sys #for sys variables
##### Flask Environment
# Returns the directory the current script (or interpreter) is running in
def get_script_directory():
path = os.path.realpath(sys.argv[0])
if os.path.isdir(path):
return path
else:
return os.path.dirname(path)
myScriptDirectory = get_script_directory()
#############################################################
# In a myconfig.py file, think to define the database server name
#############################################################
import myconfig #my configuration : edit this if needed
#print(myconfig.myDatabaseURI )
myDirectory =myScriptDirectory+myconfig.UPLOAD_SUBDIRECTORY
if not os.path.exists(myDirectory ):
os.makedirs(myDirectory )
app = Flask(__name__) #flask application
app.config['SECRET_KEY'] = 'Thisissupposedtobesecret!' #what you want
app.config['SQLALCHEMY_TRACK_MODIFICATIONS'] = False #avoid a warning
app.config['SQLALCHEMY_DATABASE_URI'] =myconfig.myDatabaseURI #database choice
bootstrap = Bootstrap(app) #for bootstrap compatiblity
##### Flask Environment
# Returns the directory the current script (or interpreter) is running in
def get_script_directory():
path = os.path.realpath(sys.argv[0])
if os.path.isdir(path):
return path
else:
return os.path.dirname(path)
myScriptDirectory = get_script_directory()
#############################################################
# In a myconfig.py file, think to define the database server name
#############################################################
import myconfig #my configuration : edit this if needed
#print(myconfig.myDatabaseURI )
myDirectory =myScriptDirectory+myconfig.UPLOAD_SUBDIRECTORY
if not os.path.exists(myDirectory ):
os.makedirs(myDirectory )
app = Flask(__name__) #flask application
app.config['SECRET_KEY'] = 'Thisissupposedtobesecret!' #what you want
app.config['SQLALCHEMY_TRACK_MODIFICATIONS'] = False #avoid a warning
app.config['SQLALCHEMY_DATABASE_URI'] =myconfig.myDatabaseURI #database choice
bootstrap = Bootstrap(app) #for bootstrap compatiblity
Création de la base de données, des tables et des répertoires nécessaires.
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
############# #########################
# Database and Tables
#######################################
db = SQLAlchemy(app)#the current database attached to the app.
return[sep.join(words[i:i+n])for i inrange(len(words)-n+1)]
#one-gram tokenizer
tokenizer = nltk.RegexpTokenizer(r'\w+')
#######################################################
#Save session data in a global DataFrame depending on user_id
global dfSession
dfSession = pd.DataFrame(columns=['user_id', 'userName', 'role', 'keyword', 'tldLang', 'keywordId', 'keywordUserId'])
dfSession.set_index('user_id', inplace=True)
dfSession.info()
#for tfidf counts
def top_tfidf_feats(row, features, top_n=25):
''' Get top n tfidf values in row and return them with their corresponding feature names.'''
topn_ids = np.argsort(row)[::-1][:top_n]
top_feats = [(features[i], row[i]) for i in topn_ids]
df = pd.DataFrame(top_feats)
df.columns = ['feature', 'value']
return df
def top_mean_feats(Xtr, features, grp_ids=None, top_n=25):
''' Return the top n features that on average are most important amongst documents in rows
indentified by indices in grp_ids. '''
if grp_ids:
D = Xtr[grp_ids].toarray()
else:
D = Xtr.toarray()
#D[D < min_tfidf] = 0 #keep all values
tfidf_means = np.mean(D, axis=0)
return top_tfidf_feats(tfidf_means, features, top_n)
#Best for original Keywords
def top_nonzero_mean_feats(Xtr, features, grp_ids=None, top_n=25):
''' Return the top n features that on nonzero average are most important amongst documents in rows
indentified by indices in grp_ids. '''
if grp_ids:
D = Xtr[grp_ids].toarray()
else:
D = Xtr.toarray()
#D[D < min_tfidf] = 0
tfidf_nonzero_means = np.nanmean(np.where(D!=0,D,np.nan), axis=0) #change 0 in NaN
return top_tfidf_feats(tfidf_nonzero_means, features, top_n)
@login_manager.user_loader
def load_user(user_id):
return User.query.get(int(user_id))
#Forms
class LoginForm(FlaskForm):
username = StringField('username', validators=[InputRequired(), Length(min=4, max=15)])
password = PasswordField('password', validators=[InputRequired(), Length(min=8, max=80)])
remember = BooleanField('remember me')
#we don't use this for the moment
class RegisterForm(FlaskForm):
email = StringField('email', validators=[InputRequired(), Email(message='Invalid email'), Length(max=50)])
username = StringField('username', validators=[InputRequired(), Length(min=4, max=15)])
password = PasswordField('password', validators=[InputRequired(), Length(min=8, max=80)])
role = 4
#search keywords by keyword
class SearchForm(FlaskForm):
myTLDLang = myconfig.myTLDLang
keyword = StringField('keyword / Expression', validators=[InputRequired(), Length(max=200)])
tldLang = SelectField('Country - Language', choices=myTLDLang, validators=[InputRequired()])
############### other functions
#####Get strings from tags
def getStringfromTag(tag="h1", soup="") :
theTag = soup.find_all(tag)
myTag = ""
for x in theTag:
myTag= myTag + " " + x.text.strip()
return myTag.strip()
#remove comments and non visible tags
def tag_visible(element):
if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
return False
if isinstance(element, Comment):
return False
return True
def strip_accents(text, encoding='utf-8'):
"""
Strip accents from input String.
:param text: The input string.
:type text: String.
:returns: The processed String.
:rtype: String.
"""
text = unicodedata.normalize('NFD', text)
text = text.encode('ascii', 'ignore')
text = text.decode(encoding)
return str(text)
# Get a random user agent.
def getRandomUserAgent(userAgentsList, userAgentLanguage):
theUserAgent = random.choice(userAgentsList)
if len(userAgentLanguage) > 0 :
theUserAgent = theUserAgent.replace("{{tagLang}}","; "+str(userAgentLanguage))
else :
theUserAgent = theUserAgent.replace("{{tagLang}}",""+str(userAgentLanguage))
return theUserAgent
#ngrams in list
def words_to_ngrams(words, n, sep=" "):
return [sep.join(words[i:i+n]) for i in range(len(words)-n+1)]
#one-gram tokenizer
tokenizer = nltk.RegexpTokenizer(r'\w+')
#######################################################
#Save session data in a global DataFrame depending on user_id
global dfSession
dfSession = pd.DataFrame(columns=['user_id', 'userName', 'role', 'keyword', 'tldLang', 'keywordId', 'keywordUserId'])
dfSession.set_index('user_id', inplace=True)
dfSession.info()
#for tfidf counts
def top_tfidf_feats(row, features, top_n=25):
''' Get top n tfidf values in row and return them with their corresponding feature names.'''
topn_ids = np.argsort(row)[::-1][:top_n]
top_feats = [(features[i], row[i]) for i in topn_ids]
df = pd.DataFrame(top_feats)
df.columns = ['feature', 'value']
return df
def top_mean_feats(Xtr, features, grp_ids=None, top_n=25):
''' Return the top n features that on average are most important amongst documents in rows
indentified by indices in grp_ids. '''
if grp_ids:
D = Xtr[grp_ids].toarray()
else:
D = Xtr.toarray()
#D[D < min_tfidf] = 0 #keep all values
tfidf_means = np.mean(D, axis=0)
return top_tfidf_feats(tfidf_means, features, top_n)
#Best for original Keywords
def top_nonzero_mean_feats(Xtr, features, grp_ids=None, top_n=25):
''' Return the top n features that on nonzero average are most important amongst documents in rows
indentified by indices in grp_ids. '''
if grp_ids:
D = Xtr[grp_ids].toarray()
else:
D = Xtr.toarray()
#D[D < min_tfidf] = 0
tfidf_nonzero_means = np.nanmean(np.where(D!=0,D,np.nan), axis=0) #change 0 in NaN
return top_tfidf_feats(tfidf_nonzero_means, features, top_n)
@login_manager.user_loader
def load_user(user_id):
return User.query.get(int(user_id))
#Forms
class LoginForm(FlaskForm):
username = StringField('username', validators=[InputRequired(), Length(min=4, max=15)])
password = PasswordField('password', validators=[InputRequired(), Length(min=8, max=80)])
remember = BooleanField('remember me')
#we don't use this for the moment
class RegisterForm(FlaskForm):
email = StringField('email', validators=[InputRequired(), Email(message='Invalid email'), Length(max=50)])
username = StringField('username', validators=[InputRequired(), Length(min=4, max=15)])
password = PasswordField('password', validators=[InputRequired(), Length(min=8, max=80)])
role = 4
#search keywords by keyword
class SearchForm(FlaskForm):
myTLDLang = myconfig.myTLDLang
keyword = StringField('keyword / Expression', validators=[InputRequired(), Length(max=200)])
tldLang = SelectField('Country - Language', choices=myTLDLang, validators=[InputRequired()])
############### other functions
#####Get strings from tags
def getStringfromTag(tag="h1", soup="") :
theTag = soup.find_all(tag)
myTag = ""
for x in theTag:
myTag= myTag + " " + x.text.strip()
return myTag.strip()
#remove comments and non visible tags
def tag_visible(element):
if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
return False
if isinstance(element, Comment):
return False
return True
def strip_accents(text, encoding='utf-8'):
"""
Strip accents from input String.
:param text: The input string.
:type text: String.
:returns: The processed String.
:rtype: String.
"""
text = unicodedata.normalize('NFD', text)
text = text.encode('ascii', 'ignore')
text = text.decode(encoding)
return str(text)
# Get a random user agent.
def getRandomUserAgent(userAgentsList, userAgentLanguage):
theUserAgent = random.choice(userAgentsList)
if len(userAgentLanguage) > 0 :
theUserAgent = theUserAgent.replace("{{tagLang}}","; "+str(userAgentLanguage))
else :
theUserAgent = theUserAgent.replace("{{tagLang}}",""+str(userAgentLanguage))
return theUserAgent
#ngrams in list
def words_to_ngrams(words, n, sep=" "):
return [sep.join(words[i:i+n]) for i in range(len(words)-n+1)]
#one-gram tokenizer
tokenizer = nltk.RegexpTokenizer(r'\w+')
#################### WebSite ##################################
#Routes
@app.route('/')
def index():
return render_template('index.html')
@app.route('/login', methods=['GET', 'POST'])
def login():
form = LoginForm()
if form.validate_on_submit():
user = User.query.filter_by(username=form.username.data).first()
if user:
if check_password_hash(user.password, form.password.data):
login_user(user, remember=form.remember.data)
return redirect(url_for('keywordssuggest')) #go to the keywords Suggest page
return '<h1>Invalid password</h1>'
return '<h1>Invalid username</h1>'
#return '<h1>' + form.username.data + ' ' + form.password.data + '</h1>'
return render_template('login.html', form=form)
#Not used here.
@app.route('/signup', methods=['GET', 'POST'])
def signup():
form = RegisterForm()
if form.validate_on_submit():
hashed_password = generate_password_hash(form.password.data, method='sha256')
new_user = User(username=form.username.data, email=form.email.data, password=hashed_password)
db.session.add(new_user)
db.session.commit()
return '<h1>New user has been created!</h1>'
#return '<h1>' + form.username.data + ' ' + form.email.data + ' ' + form.password.data + '</h1>'
return render_template('signup.html', form=form)
#Not used here.
@app.route('/dashboard')
@login_required
def dashboard():
return render_template('dashboard.html', name=current_user.username)
@app.route('/logout')
@login_required
def logout():
logout_user()
return redirect(url_for('index'))
#################### WebSite ##################################
#Routes
@app.route('/')
def index():
return render_template('index.html')
@app.route('/login', methods=['GET', 'POST'])
def login():
form = LoginForm()
if form.validate_on_submit():
user = User.query.filter_by(username=form.username.data).first()
if user:
if check_password_hash(user.password, form.password.data):
login_user(user, remember=form.remember.data)
return redirect(url_for('keywordssuggest')) #go to the keywords Suggest page
return '<h1>Invalid password</h1>'
return '<h1>Invalid username</h1>'
#return '<h1>' + form.username.data + ' ' + form.password.data + '</h1>'
return render_template('login.html', form=form)
#Not used here.
@app.route('/signup', methods=['GET', 'POST'])
def signup():
form = RegisterForm()
if form.validate_on_submit():
hashed_password = generate_password_hash(form.password.data, method='sha256')
new_user = User(username=form.username.data, email=form.email.data, password=hashed_password)
db.session.add(new_user)
db.session.commit()
return '<h1>New user has been created!</h1>'
#return '<h1>' + form.username.data + ' ' + form.email.data + ' ' + form.password.data + '</h1>'
return render_template('signup.html', form=form)
#Not used here.
@app.route('/dashboard')
@login_required
def dashboard():
return render_template('dashboard.html', name=current_user.username)
@app.route('/logout')
@login_required
def logout():
logout_user()
return redirect(url_for('index'))
Ici la page d’inscription et la page de dashboard global ne sont pas utilisées pour l’instant.
Remarque : Flask utilise le concept de « route« . En développement Web, on appelle route une URL ou un ensemble d’URLs conduisant à l’exécution d’une fonction donnée.
Dans Flask, les routes sont déclarées via le décorateur app.route
Route TFIDFkeywordsuggest : saisie de l’expression et du pays/langue
role =myRole, MaxResults=myconfig.myMaxResults[myRole], limitReached=myLimitReached)
#Route TFIDFkeywordssuggest
@app.route('/tfidfkeywordssuggest',methods=['GET', 'POST'])
@login_required
def tfidfkeywordssuggest():
print("tfidfkeywordssuggest")
#print("g.userId="+str(g.userId))
if current_user.is_authenticated: #always because @login_required
print("UserId= "+str(current_user.get_id()))
myUserId = current_user.get_id()
print("UserName = "+current_user.username)
dfSession.loc[ myUserId, 'userName'] = current_user.username #Save Username for userId
#make sure we have a good Role
if current_user.role is None or current_user.role > 4 or current_user.role <0 :
dfSession.loc[ myUserId,'role'] = 4 #4 is for guest
else :
dfSession.loc[ myUserId,'role'] = current_user.role
myRole = dfSession.loc[ myUserId,'role']
#count searches in a day
myLimitReached = False
mySearchesCount = db.session.query(KeywordUser).filter_by(username=current_user.username, search_date=date.today()).count()
print("mySearchesCount="+str(mySearchesCount))
print(" myconfig.myMaxSearchesByDay[myRole]="+str(myconfig.myMaxSearchesByDay[myRole]))
if (mySearchesCount >= myconfig.myMaxSearchesByDay[myRole]):
myLimitReached=True
#raz value
dfSession.loc[myUserId,'keyword'] = ""
dfSession.loc[myUserId,'tldLang'] =""
form = SearchForm()
if form.validate_on_submit():
dfSession.loc[myUserId,'keyword'] = form.keyword.data #save in session variable
dfSession.loc[myUserId,'tldLang'] = form.tldLang.data #save in session variable
dfSession.head()
return render_template('tfidfkeywordssuggest.html', name=current_user.username, form=form,
keyword = form.keyword.data , tldLang = form.tldLang.data,
role =myRole, MaxResults=myconfig.myMaxResults[myRole], limitReached=myLimitReached)
#Route TFIDFkeywordssuggest
@app.route('/tfidfkeywordssuggest',methods=['GET', 'POST'])
@login_required
def tfidfkeywordssuggest():
print("tfidfkeywordssuggest")
#print("g.userId="+str(g.userId))
if current_user.is_authenticated: #always because @login_required
print("UserId= "+str(current_user.get_id()))
myUserId = current_user.get_id()
print("UserName = "+current_user.username)
dfSession.loc[ myUserId, 'userName'] = current_user.username #Save Username for userId
#make sure we have a good Role
if current_user.role is None or current_user.role > 4 or current_user.role <0 :
dfSession.loc[ myUserId,'role'] = 4 #4 is for guest
else :
dfSession.loc[ myUserId,'role'] = current_user.role
myRole = dfSession.loc[ myUserId,'role']
#count searches in a day
myLimitReached = False
mySearchesCount = db.session.query(KeywordUser).filter_by(username=current_user.username, search_date=date.today()).count()
print("mySearchesCount="+str(mySearchesCount))
print(" myconfig.myMaxSearchesByDay[myRole]="+str(myconfig.myMaxSearchesByDay[myRole]))
if (mySearchesCount >= myconfig.myMaxSearchesByDay[myRole]):
myLimitReached=True
#raz value
dfSession.loc[myUserId,'keyword'] = ""
dfSession.loc[myUserId,'tldLang'] =""
form = SearchForm()
if form.validate_on_submit():
dfSession.loc[myUserId,'keyword'] = form.keyword.data #save in session variable
dfSession.loc[myUserId,'tldLang'] = form.tldLang.data #save in session variable
dfSession.head()
return render_template('tfidfkeywordssuggest.html', name=current_user.username, form=form,
keyword = form.keyword.data , tldLang = form.tldLang.data,
role =myRole, MaxResults=myconfig.myMaxResults[myRole], limitReached=myLimitReached)
Cette route gère le formulaire de saisie, elle vérifie aussi si vous avez les droits pour faire une recherche.
Route PROGRESS
La route « progress » effectue les recherches à proprement parlé. Elle dialogue aussi avec le client (la page keywordssuggest.html) en envoyant des informations à celui-ci via la fonction generate qui est appelée en boucle.
Cette partie étant relativement longue nous la traiterons en plusieurs étapes.
Initialisation et récupération des données de session
if Delta.days > myconfig.myRefreshDelay : #30 by defaukt
goSearch=True
##############################
#run
###############################
if ( len(myKeyword) > 0 and len(myTLDLang) >0) :
myDate=date.today()
print('myDate='+str(myDate))
goSearch=False #do we made a new search in Google ?
#did anybody already made this search during the last x days ????
firstKWC = db.session.query(Keyword).filter_by(keyword=myKeyword, tldLang=myTLDLang).first()
#lastSearchDate = Keyword.query.filter(keyword==myKeyword, tldLang==myTLDLang ).first().format('search_date')
if firstKWC is None:
goSearch=True
else:
myKeywordId=firstKWC.id
dfSession.loc[ myUserId,'keywordId']=myKeywordId #Save in the dfSession
print("last Search Date="+ str(firstKWC.search_date))
Delta = myDate - firstKWC.search_date
print(" Delta in days="+str(Delta.days))
if Delta.days > myconfig.myRefreshDelay : #30 by defaukt
goSearch=True
##############################
#run
###############################
if ( len(myKeyword) > 0 and len(myTLDLang) >0) :
myDate=date.today()
print('myDate='+str(myDate))
goSearch=False #do we made a new search in Google ?
#did anybody already made this search during the last x days ????
firstKWC = db.session.query(Keyword).filter_by(keyword=myKeyword, tldLang=myTLDLang).first()
#lastSearchDate = Keyword.query.filter(keyword==myKeyword, tldLang==myTLDLang ).first().format('search_date')
if firstKWC is None:
goSearch=True
else:
myKeywordId=firstKWC.id
dfSession.loc[ myUserId,'keywordId']=myKeywordId #Save in the dfSession
print("last Search Date="+ str(firstKWC.search_date))
Delta = myDate - firstKWC.search_date
print(" Delta in days="+str(Delta.days))
if Delta.days > myconfig.myRefreshDelay : #30 by defaukt
goSearch=True
Scrap dans Google et sauvegarde dans la table des positions de la base de données
###############################################
# Search in Google and scrap Urls
###############################################
if ( len(myKeyword) > 0 and len(myTLDLang) >0 and goSearch) :
#get default language for tld
myTLD = myconfig.dfTLDLanguages.loc[myTLDLang, 'tld']
#myTLD=myTLD.strip()
print("myTLD="+myTLD+"!")
myHl = str(myconfig.dfTLDLanguages.loc[myTLDLang, 'hl'])
#myHl=myHl.strip()
print("myHl="+myHl+"!")
myLanguageResults = str(myconfig.dfTLDLanguages.loc[myTLDLang, 'lr'])
#myLanguageResults=myLanguageResults.strip()
print("myLanguageResults="+myLanguageResults+"!")
myCountryResults = str(myconfig.dfTLDLanguages.loc[myTLDLang, 'cr'])
#myCountryResults=myCountryResults.strip()
print("myCountryResults="+myCountryResults+"!")
myUserAgentLanguage = str(myconfig.dfTLDLanguages.loc[myTLDLang, 'userAgentLanguage'])
#myCountryResults=myCountryResults.strip()
print("myUserAgentLanguage="+myUserAgentLanguage+"!")
###############################
# Google Scrap
###############################
myNum=10
myStart=0
myStop=10 #get by ten
myMaxStart=myconfig.myMaxPagesToScrap #only 3 for test 10 in production
#myTbs= "qdr:m" #rsearch only last month not used
#tbs=myTbs,
#pause may be long to avoir blocking from Google
myLowPause=myconfig.myLowPause
myHighPause=myconfig.myHighPause
nbTrials = 0
#this may be long
while myStart < myMaxStart:
myShow= int(round(((myStart*50)/myMaxStart)+1)) #for the progress bar
yield "data:" + str(myShow) + "\n\n"
print("PASSAGE NUMBER :"+str(myStart))
print("Query:"+myKeyword)
#change user-agent and pause to avoid blocking by Google
myPause = random.randint(myLowPause,myHighPause) #long pause
print("Pause:"+str(myPause))
#change user_agent and provide local language in the User Agent
myUserAgent = getRandomUserAgent(myconfig.userAgentsList, myUserAgentLanguage)
#myUserAgent = googlesearch.get_random_user_agent()
print("UserAgent:"+str(myUserAgent))
df = pd.DataFrame(columns=['query', 'page', 'position', 'source']) #working dataframe
myPause=myPause*(nbTrials+1) #up the pause if trial get nothing
print("Pause:"+str(myPause))
try :
urls = googlesearch.search(query=myKeyword, tld=myTLD, lang=myHl, safe='off',
num=myNum, start=myStart, stop=myStop, domains=None, pause=myPause,
country=myCountryResults, extra_params={'lr': myLanguageResults}, tpe='', user_agent=myUserAgent)
df = pd.DataFrame(columns=['keyword', 'tldLang', 'page', 'position', 'source', 'search_date'])
for url in urls :
print("URL:"+url)
df.loc[df.shape[0],'page'] = url
df['keyword'] = myKeyword #fill with current keyword
df['tldLang'] = myTLDLang #fill with current country / tld lang
df['position'] = df.index.values + 1 + myStart #position = index +1 + myStart
df['source'] = "Scrap" #fill with source origin here scraping Google
#other potentials options : Semrush, Yooda Insight...
df['search_date'] = myDate
dfScrap = pd.concat([dfScrap, df], ignore_index=True) #concat scraps
# time.sleep(myPause) #add another pause
if (df.shape[0] > 0):
nbTrials = 0
myStart += 10
else :
nbTrials +=1
if (nbTrials > 3) :
nbTrials = 0
myStart += 10
#myStop += 10
except :
exc_type, exc_value, exc_traceback = sys.exc_info()
print("GOOGLE ERROR")
print(exc_type.__name__)
print(exc_value)
print(exc_traceback)
time.sleep(600) #add a big pause if you get an error.
#/while myStart < myMaxStart:
#dfScrap.info()
dfScrapUnique=dfScrap.drop_duplicates() #remove duplicates
#dfScrapUnique.info()
#Save in csv an json if needed
#dfScrapUnique.to_csv("dfScrapUnique.csv", sep=myconfig.myCsvSep, encoding='utf-8', index=False)
#dfScrapUnique.to_json("dfScrapUnique.json")
#Bulk save in position table
#save dataframe in table Position
dfScrapUnique.to_sql('position', con=db.engine, if_exists='append', index=False)
myShow=50
yield "data:" + str(myShow) + "\n\n" #to show 50 %
#/end search in Google
###############################################
# Search in Google and scrap Urls
###############################################
if ( len(myKeyword) > 0 and len(myTLDLang) >0 and goSearch) :
#get default language for tld
myTLD = myconfig.dfTLDLanguages.loc[myTLDLang, 'tld']
#myTLD=myTLD.strip()
print("myTLD="+myTLD+"!")
myHl = str(myconfig.dfTLDLanguages.loc[myTLDLang, 'hl'])
#myHl=myHl.strip()
print("myHl="+myHl+"!")
myLanguageResults = str(myconfig.dfTLDLanguages.loc[myTLDLang, 'lr'])
#myLanguageResults=myLanguageResults.strip()
print("myLanguageResults="+myLanguageResults+"!")
myCountryResults = str(myconfig.dfTLDLanguages.loc[myTLDLang, 'cr'])
#myCountryResults=myCountryResults.strip()
print("myCountryResults="+myCountryResults+"!")
myUserAgentLanguage = str(myconfig.dfTLDLanguages.loc[myTLDLang, 'userAgentLanguage'])
#myCountryResults=myCountryResults.strip()
print("myUserAgentLanguage="+myUserAgentLanguage+"!")
###############################
# Google Scrap
###############################
myNum=10
myStart=0
myStop=10 #get by ten
myMaxStart=myconfig.myMaxPagesToScrap #only 3 for test 10 in production
#myTbs= "qdr:m" #rsearch only last month not used
#tbs=myTbs,
#pause may be long to avoir blocking from Google
myLowPause=myconfig.myLowPause
myHighPause=myconfig.myHighPause
nbTrials = 0
#this may be long
while myStart < myMaxStart:
myShow= int(round(((myStart*50)/myMaxStart)+1)) #for the progress bar
yield "data:" + str(myShow) + "\n\n"
print("PASSAGE NUMBER :"+str(myStart))
print("Query:"+myKeyword)
#change user-agent and pause to avoid blocking by Google
myPause = random.randint(myLowPause,myHighPause) #long pause
print("Pause:"+str(myPause))
#change user_agent and provide local language in the User Agent
myUserAgent = getRandomUserAgent(myconfig.userAgentsList, myUserAgentLanguage)
#myUserAgent = googlesearch.get_random_user_agent()
print("UserAgent:"+str(myUserAgent))
df = pd.DataFrame(columns=['query', 'page', 'position', 'source']) #working dataframe
myPause=myPause*(nbTrials+1) #up the pause if trial get nothing
print("Pause:"+str(myPause))
try :
urls = googlesearch.search(query=myKeyword, tld=myTLD, lang=myHl, safe='off',
num=myNum, start=myStart, stop=myStop, domains=None, pause=myPause,
country=myCountryResults, extra_params={'lr': myLanguageResults}, tpe='', user_agent=myUserAgent)
df = pd.DataFrame(columns=['keyword', 'tldLang', 'page', 'position', 'source', 'search_date'])
for url in urls :
print("URL:"+url)
df.loc[df.shape[0],'page'] = url
df['keyword'] = myKeyword #fill with current keyword
df['tldLang'] = myTLDLang #fill with current country / tld lang
df['position'] = df.index.values + 1 + myStart #position = index +1 + myStart
df['source'] = "Scrap" #fill with source origin here scraping Google
#other potentials options : Semrush, Yooda Insight...
df['search_date'] = myDate
dfScrap = pd.concat([dfScrap, df], ignore_index=True) #concat scraps
# time.sleep(myPause) #add another pause
if (df.shape[0] > 0):
nbTrials = 0
myStart += 10
else :
nbTrials +=1
if (nbTrials > 3) :
nbTrials = 0
myStart += 10
#myStop += 10
except :
exc_type, exc_value, exc_traceback = sys.exc_info()
print("GOOGLE ERROR")
print(exc_type.__name__)
print(exc_value)
print(exc_traceback)
time.sleep(600) #add a big pause if you get an error.
#/while myStart < myMaxStart:
#dfScrap.info()
dfScrapUnique=dfScrap.drop_duplicates() #remove duplicates
#dfScrapUnique.info()
#Save in csv an json if needed
#dfScrapUnique.to_csv("dfScrapUnique.csv", sep=myconfig.myCsvSep, encoding='utf-8', index=False)
#dfScrapUnique.to_json("dfScrapUnique.json")
#Bulk save in position table
#save dataframe in table Position
dfScrapUnique.to_sql('position', con=db.engine, if_exists='append', index=False)
myShow=50
yield "data:" + str(myShow) + "\n\n" #to show 50 %
#/end search in Google
Récupération du contenu des pages Web et sauvegarde dans la table des pages.
###################################
#update keyword and keyworduser
###################################
#need to get firtsKWC in keyword table before
firstKWC = db.session.query(Keyword).filter_by(keyword=myKeyword, tldLang=myTLDLang).first()
#Do we just process a new Google Scrap and Page Scrap ?
if goSearch :
myDataDate = myDate #Today
else :
if firstKWC is None :
myDataDate = myDate #Today
else :
myDataDate = firstKWC.data_date #old data date
#do somebody already process a research before ?
if firstKWC is None :
#insert
newKeyword = Keyword(keyword= myKeyword, tldLang=myTLDLang , data_date=myDataDate, search_date=myDate)
db.session.add(newKeyword)
db.session.commit()
db.session.refresh(newKeyword)
db.session.commit()
myKeywordId = newKeyword.id #
else :
myKeywordId = firstKWC.id
#update
firstKWC.data_date=myDataDate
firstKWC.search_date=myDate
db.session.commit()
myShow=91
yield "data:" + str(myShow) + "\n\n" #to show 91%
#for KeywordUSer
#Did this user already process this search ?
print(" myKeywordId="+str(myKeywordId))
dfSession.loc[ myUserId,'keywordId']=myKeywordId
dbKeywordUser = db.session.query(KeywordUser).filter_by(keyword= myKeyword, tldLang=myTLDLang, username=myUserName).first()
if dbKeywordUser is None :
print("insert index new Keyword for = "+ myUserName)
newKeywordUser = KeywordUser(keywordId= myKeywordId, keyword= myKeyword,
tldLang=myTLDLang , username= myUserName, data_date=myDataDate, search_date=myDate)
db.session.add(newKeywordUser)
db.session.commit()
myKeywordUserId=newKeywordUser.id
else :
myKeywordUserId=dbKeywordUser.id #for the name
print("exists update only myDataDate" )
#update values for the current user
dbKeywordUser.data_date=myDataDate
dbKeywordUser.search_date=myDate
db.session.commit()
###################################
#update keyword and keyworduser
###################################
#need to get firtsKWC in keyword table before
firstKWC = db.session.query(Keyword).filter_by(keyword=myKeyword, tldLang=myTLDLang).first()
#Do we just process a new Google Scrap and Page Scrap ?
if goSearch :
myDataDate = myDate #Today
else :
if firstKWC is None :
myDataDate = myDate #Today
else :
myDataDate = firstKWC.data_date #old data date
#do somebody already process a research before ?
if firstKWC is None :
#insert
newKeyword = Keyword(keyword= myKeyword, tldLang=myTLDLang , data_date=myDataDate, search_date=myDate)
db.session.add(newKeyword)
db.session.commit()
db.session.refresh(newKeyword)
db.session.commit()
myKeywordId = newKeyword.id #
else :
myKeywordId = firstKWC.id
#update
firstKWC.data_date=myDataDate
firstKWC.search_date=myDate
db.session.commit()
myShow=91
yield "data:" + str(myShow) + "\n\n" #to show 91%
#for KeywordUSer
#Did this user already process this search ?
print(" myKeywordId="+str(myKeywordId))
dfSession.loc[ myUserId,'keywordId']=myKeywordId
dbKeywordUser = db.session.query(KeywordUser).filter_by(keyword= myKeyword, tldLang=myTLDLang, username=myUserName).first()
if dbKeywordUser is None :
print("insert index new Keyword for = "+ myUserName)
newKeywordUser = KeywordUser(keywordId= myKeywordId, keyword= myKeyword,
tldLang=myTLDLang , username= myUserName, data_date=myDataDate, search_date=myDate)
db.session.add(newKeywordUser)
db.session.commit()
myKeywordUserId=newKeywordUser.id
else :
myKeywordUserId=dbKeywordUser.id #for the name
print("exists update only myDataDate" )
#update values for the current user
dbKeywordUser.data_date=myDataDate
dbKeywordUser.search_date=myDate
db.session.commit()
Création des fichiers de mots clés calculés par TF-IDF (et fin de /progress)
<tr><td><ahref="/popAllCSV">Download most {{ MaxResults }} popular expressions among Urls crawled</a></td><tdwidth="10%"></td><td><ahref="/oriAllCSV">Download {{ MaxResults }} original expressions among Urls crawled</a></td></tr>
<tr><td><ahref="/pop1CSV">Download most {{ MaxResults }} popular 1 word expressions among Urls crawled </a></td><tdwidth="10%"></td><td><ahref="/ori1CSV">Download {{ MaxResults }} original 1 word expressions among Urls crawled</a></td></tr>
<tr><td><ahref="/pop2CSV">Download most {{ MaxResults }} popular 2 words expressions among Urls crawled </a></td><tdwidth="10%"></td><td><ahref="/ori2CSV">Download {{ MaxResults }} original 2 words expressions among Urls crawled</a></td></tr>
<tr><td><ahref="/pop3CSV">Download most {{ MaxResults }} popular 3 words expressions among Urls crawled</a></td><tdwidth="10%"></td><td><ahref="/ori3CSV">Download {{ MaxResults }} original 3 words expressions among Urls crawled</a></td></tr>
<tr><td><ahref="/pop4CSV">Download most {{ MaxResults }} popular 4 words expressions among Urls crawled</a></td><tdwidth="10%"></td><td><ahref="/ori4CSV">Download {{ MaxResults }} original 4 words expressions among Urls crawled</a></td></tr>
<tr><td><ahref="/pop5CSV">Download most {{ MaxResults }} popular 5 words expressions among Urls crawled</a></td><tdwidth="10%"></td><td><ahref="/ori5CSV">Download {{ MaxResults }} original 5 words expressions among Urls crawled</a></td></tr>
<tr><td><ahref="/pop6CSV">Download most {{ MaxResults }} popular 6 words expressions among Urls crawled </a></td><tdwidth="10%"></td><td><ahref="/ori6CSV">Download {{ MaxResults }} original 6 words expressions among Urls crawled</a></td></tr>
En continuant à utiliser le site, vous acceptez l’utilisation des cookies. Plus d’informations
Les paramètres des cookies sur ce site sont définis sur « accepter les cookies » pour vous offrir la meilleure expérience de navigation possible. Si vous continuez à utiliser ce site sans changer vos paramètres de cookies ou si vous cliquez sur "Accepter" ci-dessous, vous consentez à cela.