Partager la publication "Anakeyn TF-IDF Keywords Suggest"
Anakeyn TF-IDF Keywords Suggest est un outil de suggestion de mots clés pour le SEO et le Web Marketing.
Cet outil récupère les x premières pages Web répondant à une requête dans Google.
Ensuite, le système va récupérer le contenu des pages afin de trouver des mots clés populaires ou originaux en rapport avec le sujet recherché. Le système fonctionne avec un algorithme TF-IDF.
TF-IDF veut dire term frequency–inverse document frequency en anglais. C’est une mesure statistique qui permet de déterminer l’importance d’un mot ou d’une expression dans un document relativement à un corpus ou une collection de documents.
Dans notre cas un document est une page Web et le corpus est l’ensemble des pages Web récupérées.
Nous avions déjà abordé le concept de TF-IDF dans un article précédent sur le Machine Learning. La formule est la suivante :

Où :
| : nombre d’occurrences de i dans j. | |
| : nombre de documents contenant i | |
| N | : nombre total de documents |
Comme ici nous avons besoin d’indicateurs « généraux » pour les expressions trouvées, nous calculons la moyenne des TF-IDF pour chaque expression pour trouver les plus populaires, et la moyenne des TF-IDF non nulles pour chaque expression pour déterminer les plus originales.
L’ensemble des codes sources et des données de cette version compatible Windows sont disponibles sur notre GitHub à l’adresse https://github.com/Anakeyn/TFIDFKeywordsSuggest
Technologie utilisée
L’application est développée en Python (pour nous Python Anaconda 3.7) au format Web.
Pour le format Web nous utilisons Flask :

Flask est un cadre de travail (framework) Web pour Python. Ainsi, il fournit des fonctionnalités permettant de construire des applications Web, ce qui inclut la gestion des requêtes HTTP et des canevas de présentation ou templates. Flask a été créé initialement par Armin Ronacher
Flask fonctionne notamment avec le moteur de templates Jinja créé par le même auteur et que nous utilisons dans ce projet.
Pour la gestion de la base de données nous utilisons SQLAlchemy. SQLAlchemy est un ORM (Object Relationnel Model, Modèle Relationnel Objet) qui permet de stocker les données des objets Python dans une représentation de type base de données sans se préoccuper des détails liés à la base.
SQLAlchemy est compatible avec PostgreSQL, MySQL, Oracle, Microsoft SQL Server… Pour l’instant nous l’utilisons avec SQLite qui est fournie par défaut avec Python.
Structure de l’application
l’application est structurée de la façon suivante :
KeywordsSuggest | database.db | favicon.ico | tfidfkeywordssuggest.py | license.txt | myconfig.py | requirements.txt | __init__.py | +---configdata | tldLang.xlsx | user_agents-taglang.txt | +---static | Anakeyn_Rectangle.jpg | tfidfkeywordssuggest.css | Oeil_Anakeyn.jpg | signin.css | starter-template.css | +---templates | index.html | tfidfkeywordssuggest.html | login.html | signup.html | +---uploads
Fichiers à la racine :
- database.db : base de données SQLite créée automatiquement lors de la première utilisation.
- favicon.ico : favicon anakeyn.
- tfidfkeywordssuggest.py : programme principal (celui à faire tourner).
- license.txt : texte de la licence GPL 3.
- myconfig.py : programme contenant les paramètres de configuration de départ et notamment le type et le nom de la base de données ainsi que les 2 utilisateurs par défaut : admin (mot de passe « adminpwd ») et guest (mot de passe « guestpwd »)
- requirements.txt : liste des bibliothèque Python à installer.
- __init__.py : marque ce dossier comme un package importable.
Répertoire configdata : il contient 2 fichiers de configuration :
- tldlang.xlsx : paramètres pour les domaines de premier niveau des différents sites Google (en anglais Top Level Domain exemple : .com, .fr, .co.uk …) et les langues de résultats souhaitées. Il y a 358 combinaisons de domaines de premier niveau / langues, ce qui permet de rechercher des mots clés dans de nombreuses langues et de nombreux pays.
- user_agents-taglang.txt : une liste d’agents utilisateurs ou user agents (par exemple un navigateur Web) valides. Ces agents sont utilisés de façon aléatoire afin d’éviter d’être reconnu par Google trop rapidement et d’être bloqué. Un tag « {{taglang}} » (au format Jinja) permet au programme de paramétrer la langue cible dans l’agent utilisateur.
Le répertoire static contient les images et les fichiers .css (feuilles de style en cascade en angais Cascading Style Sheets) de l’application Web.
le répertoire templates contient les pages html au format Jinja
le répertoires uploads sert à enregistrer les fichiers de mots clés trouvés. Un sous répertoire est créé pour chaque utilisateur.
Pour chaque interrogation, le système crée 7 fichiers de mots clés/expression « populaires » : 1 avec des tailles indéterminées entre 1 à 6 mots et 1 pour chaque taille en nombre de mots : 1, 2, 3, 4, 5 ou 6 mots. DE la même façon, il crée 7 fichiers pour les mots clés « originaux ». Si l’ensemble du corpus de documents est suffisamment grand, le système propose jusqu’à 10.000 mots clés par fichier !
Testez le programme sur votre ordinateur
- Téléchargez le fichier .zip de l’application sur notre GitHub : https://github.com/Anakeyn/TFIDFKeywordsSuggest/archive/master.zip et dézipper-le dans un répertoire de votre choix.
- Téléchargez et installez Python Anaconda https://www.anaconda.com/distribution/#download-section
- Anaconda installe plusieurs outils sur votre ordinateur :

- Ouvrez Anaconda Prompt et allez dans le répertoire ou vous avez préalablement installé l’application. Par exemple sous Windows : « >cd c:Users\myname\documents\…\… »
- Vérifiez que le fichier « requirements.txt » est bien dans votre répertoire : dir (Windows), ls (Linux). Ce fichier contient la liste des bibliothèques (ou dépendances) à installer au préalable pour faire fonctionner notre application.
- Pour les installer sous Linux tapez la commande suivante :
while read requirement; do conda install --yes $requirement || pip install $requirement; done < requirements.txt
- Pour les installer sous Windows tapez la commande suivante :
FOR /F "delims=~" %f in (requirements.txt) DO conda install --yes "%f" || pip install "%f"

Attention !!! la mise en place des dépendances peut durer un certain temps : soyez patient !! Cette opération n’est à faire qu’une seule fois.
- Ensuite lancez Spyder et ouvrez le fichier Python Principal tfidfkeywordssuggest.py

- Vérifiez que vous êtes dans le bon répertoire (celui de votre fichier python) et cliquez sur la flèche verte pour lancer le fichier Python
- Ensuite, ouvrez un navigateur et allez à l’adresse http://127.0.0.1:5000. Il s’agit de l’adresse par défaut de l’application sur votre ordinateur.

- Cliquez sur « TF-IDF Keywords Suggest » : le système est protégé par identifiant et mot de passe. Par défaut pour l’administrateur : admin, adminpwd et pour l’invité : guest, guestpwd.
- Ensuite choisissez un mot clé ou une expression et le couple pays/langue ciblé :

- Le système va rechercher dans Google les x (à déterminer dans le fichier de configuration) premières pages répondant au mot clé recherché, les sauvegarder, récupérer les contenus et calculer un TF-IDF pour chaque terme trouvé dans les pages. Ensuite il fournira 14 fichiers de résultats avec jusqu’à 10.000 expressions populaires ou originales.

- Comme vous pouvez le voir, toutes les langues ne sont pas filtrées par Google. Voir la liste pour le paramètre « lr » sur cette page https://developers.google.com/custom-search/docs/xml_results_appendices#lrsp . Toutefois avec le filtre de pays et la langue indiquée dans le user agent on obtient souvent des résultats satisfaisants.
- Par exemple, ci dessous le début du fichier des résultats de la recherche pour « SEO », contenant les expressions originales de 2 mots clés et ciblés en Swahili pour la République Démocratique du Congo :

Code Source
nous présenterons ici les codes sources en Python myconfig.py et tfidfkeywordssuggest.py ainsi que la template tfidfkeywordssuggest.html.
Vous pouvez copier/coller les codes sources suivants un à un ou bien télécharger l’ensemble gratuitement depuis notre boutique, pour être sûr d’avoir la dernière version, à l’adresse : https://www.anakeyn.com/boutique/produit/script-python-tf-idf-keywords-suggest/
myconfig.py : paramètres généraux de l’application
# -*- coding: utf-8 -*-
"""
Created on Tue Jul 16 15:41:05 2019
@author: Pierre
"""
import pandas as pd #for dataframes
#Define your database
myDatabaseURI = "sqlite:///database.db"
#define the default upload/dowload parent directory
UPLOAD_SUBDIRECTORY = "/uploads"
#define an admin and a guest
#change if needed.
myAdminLogin = "admin"
myAdminPwd = "adminpwd"
myAdminEmail = "admin@example.com"
myGuestLogin = "guest"
myGuestPwd = "guestpwd"
myGuestEmail = "guest@example.com"
#define Google TLD and languages and stopWords
#see https://developers.google.com/custom-search/docs/xml_results_appendices
#https://www.metamodpro.com/browser-language-codes
#Languages
dfTLDLanguages = pd.read_excel('configdata/tldLang.xlsx')
dfTLDLanguages.fillna('',inplace=True)
if len(dfTLDLanguages) == 0 :
#if 1==1 :
data = [['google.com', 'United States', 'com', 'en', 'lang_en', 'countryUS', 'en-us', 'english', 'United States', 'en - English'],
['google.co.uk', 'United Kingdom', 'co.uk', 'en', 'lang_en', 'countryUK', 'en-uk', 'english', 'United Kingdom', 'en - English'],
['google.de', 'Germany', 'de','de', 'lang_de', 'countryDE', 'de-de', 'german', 'Germany', 'de - German'],
['google.fr', 'France', 'fr', 'fr', 'lang_fr', 'countryFR', 'fr-fr', 'french', 'France', 'fr - French']]
dfTLDLanguages = pd.DataFrame(data, columns = ['tldLang', 'description', 'tld', 'hl', 'lr', 'cr', 'userAgentLanguage', 'stopWords', 'countryName', 'ISOLanguage' ])
myTLDLang = [tuple(r) for r in dfTLDLanguages[['tldLang', 'description']].values]
#print(myTLDLang)
dfTLDLanguages = dfTLDLanguages.drop_duplicates()
dfTLDLanguages.set_index('tldLang', inplace=True)
dfTLDLanguages.info()
#dfTLDLanguages['userAgentLanguage']
#USER AGENTS
with open('configdata/user_agents-taglang.txt') as f:
userAgentsList = f.read().splitlines()
#pauses to scrap Google
myLowPause=2
myHighPause=6
#define max pages to scrap on Google (30 is enough 100 max)
myMaxPagesToScrap=30
#define refresh delay (usually 30 days)
myRefreshDelay = 30
myMaxFeatures = 10000 #to calculate tf-idf
#Name of account type
myAccountTypeName=['Admin', 'Gold', 'Silver','Bronze', 'Guest',]
#max number of results to keep in TF-IDF depending on role
myMaxResults=[10000, 10000, 5000, 1000, 100]
#max searches by day depending on role
myMaxSearchesByDay=[10000, 10000, 1000, 100, 10]
#min ngram
myMinNGram=1 #Not a good idea to change this for the moment
#max ngram
myMaxNGram=6 ##Not a good idea to change this for the moment
#CSV separator
myCsvSep = ","
Paramètres importants :
- myDatabaseURI : vous pouvez modifier cette variable si vous souhaitez une base de données mySQL (mysql://scott:tiger@localhost/mydatabase) ou bien Postgres (postgresql://scott:tiger@localhost/mydatabase)
- myAdminLogin, myAdminPwd … si vous souhaitez modifier les identifiants/mots de passe des utilisateurs. (une gestion des utilisateurs est prévue dans le futur)
- myRefreshDelay : délai entre 2 lectures sur Google. Nous avons pris 30 jours car souvent les sources alternatives de positions comme par exemple Yooda Insight ou SEMrush sont mis à jour mensuellement. Rem: comme nous prévoyons d’utiliser leurs APIs dans le futur on aura ainsi un délai cohérent entre les différentes sources. Cela permet aussi d’être plus rapide et d’éviter de la consultation Google pour rien.
- myCsvSep : par défaut nous avons mis une « , » comme séparateur pour les fichiers de résultats. Mais si vous avez une version française d’Excel vous pouvez choisir « ; » pour pouvoir ouvrir ces fichiers.
TFIDFkeywordssuggest.py : programme principal
Ici, nous allons diviser le code source en plusieurs morceaux pour le présenter.
Chargement des bibliothèques utiles
# -*- coding: utf-8 -*-
"""
Created on Tue Jul 16 15:41:05 2019
@author: Pierre
"""
#############################################################
# Anakeyn Keywords Suggest version Alpha 0.0
# Anakeyn Keywords Suggest is a keywords suggestion tool.
# This tool searches in the first pages responding to a given keyword in Google. Next the
# system will get the content of the pages in order to find popular and original keywords
# in the subject area. The system works with a TF-IDF algorithm.
#############################################################
#Copyright 2019 Pierre Rouarch
# License GPL V3
#############################################################
#see also
#https://github.com/PrettyPrinted/building_user_login_system
#https://blog.miguelgrinberg.com/post/the-flask-mega-tutorial-part-i-hello-world
#https://github.com/MarioVilas/googlesearch #googlesearch serp scraper
############### FOR FLASK ###############################
#conda install -c anaconda flask
from flask import Flask, render_template, redirect, url_for, Response, send_file
#from flask import session #for the sessions variables
#from flask_session import Session #for sessions
#pip install flask-bootstrap #if not installed in a console
from flask_bootstrap import Bootstrap #to have a responsive design with fmask
from flask_wtf import FlaskForm #forms
from wtforms import StringField, PasswordField, BooleanField, SelectField #field types
from wtforms.validators import InputRequired, Email, Length #field validators
from flask_sqlalchemy import SQLAlchemy
from werkzeug.security import generate_password_hash, check_password_hash
from flask_login import LoginManager, UserMixin, login_user, login_required, logout_user, current_user
import time
from datetime import datetime, date #, timedelta
############## For other Functionalities
import numpy as np #for vectors and arrays
import pandas as pd #for dataframes
#pip install google #to install Google Searchlibrary by Mario Vilas
#https://python-googlesearch.readthedocs.io/en/latest/
import googlesearch #Scrap serps
#to randomize pause
import random
#
import nltk # for text mining
from nltk.corpus import stopwords
nltk.download('stopwords') #for stopwords
#print (stopwords.fileids())
#TF-IDF function
from sklearn.feature_extraction.text import TfidfVectorizer
import requests #to read urls contents
from bs4 import BeautifulSoup
from bs4.element import Comment
import re #for regex
import unicodedata #to decode accents
import os #for directories
import sys #for sys variables
J’en profite ici pour remercier Anthony Herbert et Miguel Grimberg pour leurs excellents tutoriels sur Flask, ainsi que Mario Vilas pour son scraper de pages Google, très facile à mettre en oeuvre.
Initialisation de Flask
##### Flask Environment
# Returns the directory the current script (or interpreter) is running in
def get_script_directory():
path = os.path.realpath(sys.argv[0])
if os.path.isdir(path):
return path
else:
return os.path.dirname(path)
myScriptDirectory = get_script_directory()
#############################################################
# In a myconfig.py file, think to define the database server name
#############################################################
import myconfig #my configuration : edit this if needed
#print(myconfig.myDatabaseURI )
myDirectory =myScriptDirectory+myconfig.UPLOAD_SUBDIRECTORY
if not os.path.exists(myDirectory ):
os.makedirs(myDirectory )
app = Flask(__name__) #flask application
app.config['SECRET_KEY'] = 'Thisissupposedtobesecret!' #what you want
app.config['SQLALCHEMY_TRACK_MODIFICATIONS'] = False #avoid a warning
app.config['SQLALCHEMY_DATABASE_URI'] =myconfig.myDatabaseURI #database choice
bootstrap = Bootstrap(app) #for bootstrap compatiblity
Création de la base de données, des tables et des répertoires nécessaires.
############# #########################
# Database and Tables
#######################################
db = SQLAlchemy(app) #the current database attached to the app.
#users
class User(UserMixin, db.Model):
__tablename__="user"
id = db.Column(db.Integer, primary_key=True)
username = db.Column(db.String(15), unique=True)
email = db.Column(db.String(50), unique=True)
password = db.Column(db.String(80))
role = db.Column(db.Integer)
def __init__(self, username, email, password, role):
self.username = username
self.email = email
self.password = password
self.role = role
#Global Queries / Expressions / Keywords searches
class Keyword(db.Model):
__tablename__="keyword"
id = db.Column(db.Integer, primary_key=True)
keyword = db.Column(db.String(200))
tldLang = db.Column(db.String(50)) #google tld + lang
data_date = db.Column(db.Date, nullable=False, default=datetime.date) #date of the last data update
search_date = db.Column(db.Date, nullable=False, default=datetime.date) #date of the last search asks
def __init__(self, keyword , tldLang, data_date, search_date):
self.keyword = keyword
self.tldLang = tldLang
self.data_date = data_date
self.search_date = search_date
#Queries / Expressions / Keywords searches by username
class KeywordUser(db.Model):
__tablename__="keyworduser"
id = db.Column(db.Integer, primary_key=True)
keywordId = db.Column(db.Integer) #id in the keyword Table
keyword = db.Column(db.String(200))
tldLang = db.Column(db.String(50)) #google tld + lang
username = db.Column(db.String(15)) #
data_date = db.Column(db.Date, nullable=False, default=datetime.date) #date of the last data update
search_date = db.Column(db.Date, nullable=False, default=datetime.date) #date of the last search asks
def __init__(self, keywordId, keyword , tldLang, username, data_date , search_date):
self.keywordId = keywordId
self.keyword = keyword
self.tldLang = tldLang
self.username = username
self.data_date = data_date
self.search_date = search_date
#Positions
class Position(db.Model):
__tablename__="position"
id = db.Column(db.Integer, primary_key=True)
keyword = db.Column(db.String(200))
tldLang = db.Column(db.String(50)) #google tld + lang
page = db.Column(db.String(300))
position = db.Column(db.Integer)
source= db.Column(db.String(20))
search_date = db.Column(db.Date, nullable=False, default=datetime.date)
def __init__(self, keyword , tldLang, page, position, source, search_date):
self.keyword = keyword
self.tldLang = tldLang
self.page = page
self.position = position
self.source = source
self.search_date = search_date
#Page content
class Page(db.Model):
__tablename__="page"
id = db.Column(db.Integer, primary_key=True)
page = db.Column(db.String(300))
statusCode= db.Column(db.Integer)
html= db.Column(db.Text)
encoding = db.Column(db.String(20))
elapsedTime = db.Column(db.Float) #could be interesting for future purpose.
body= db.Column(db.Text)
search_date = db.Column(db.Date, nullable=False, default=datetime.date)
def __init__(self, page , statusCode, html, encoding, elapsedTime, body, search_date):
self.page = page
self.statusCode = statusCode
self.html = html
self.encoding = encoding
self.elapsedTime = elapsedTime
self.body = body
self.search_date = search_date
##############
db.create_all() #create database and tables if not exist
db.session.commit() #execute previous instruction
#Create an admin if not exists
exists = db.session.query(
db.session.query(User).filter_by(username=myconfig.myAdminLogin).exists()
).scalar()
if not exists :
hashed_password = generate_password_hash(myconfig.myAdminPwd, method='sha256')
administrator = User(myconfig.myAdminLogin, myconfig.myAdminEmail, hashed_password, 0)
db.session.add(administrator)
db.session.commit() #execute
#####
#create upload/dowload directory for admin
myDirectory = myScriptDirectory+myconfig.UPLOAD_SUBDIRECTORY+"/"+myconfig.myAdminLogin
if not os.path.exists(myDirectory):
os.makedirs(myDirectory)
#Create a guest if not exists
exists = db.session.query(
db.session.query(User).filter_by(username=myconfig.myGuestLogin).exists()
).scalar()
if not exists :
hashed_password = generate_password_hash(myconfig.myGuestPwd, method='sha256')
guest = User(myconfig.myGuestLogin, myconfig.myGuestEmail, hashed_password, 4)
db.session.add(guest)
db.session.commit() #execute
#####
#create upload/dowload directory for guest
myDirectory = myScriptDirectory+myconfig.UPLOAD_SUBDIRECTORY+"/"+myconfig.myGuestLogin
if not os.path.exists(myDirectory):
os.makedirs(myDirectory)
#init login_manager
login_manager = LoginManager()
login_manager.init_app(app)
login_manager.login_view = 'login'
Autres initialisations et FONCTIONS UTILES
#######################################################
#Save session data in a global DataFrame depending on user_id
global dfSession
dfSession = pd.DataFrame(columns=['user_id', 'userName', 'role', 'keyword', 'tldLang', 'keywordId', 'keywordUserId'])
dfSession.set_index('user_id', inplace=True)
dfSession.info()
#for tfidf counts
def top_tfidf_feats(row, features, top_n=25):
''' Get top n tfidf values in row and return them with their corresponding feature names.'''
topn_ids = np.argsort(row)[::-1][:top_n]
top_feats = [(features[i], row[i]) for i in topn_ids]
df = pd.DataFrame(top_feats)
df.columns = ['feature', 'value']
return df
def top_mean_feats(Xtr, features, grp_ids=None, top_n=25):
''' Return the top n features that on average are most important amongst documents in rows
indentified by indices in grp_ids. '''
if grp_ids:
D = Xtr[grp_ids].toarray()
else:
D = Xtr.toarray()
#D[D < min_tfidf] = 0 #keep all values
tfidf_means = np.mean(D, axis=0)
return top_tfidf_feats(tfidf_means, features, top_n)
#Best for original Keywords
def top_nonzero_mean_feats(Xtr, features, grp_ids=None, top_n=25):
''' Return the top n features that on nonzero average are most important amongst documents in rows
indentified by indices in grp_ids. '''
if grp_ids:
D = Xtr[grp_ids].toarray()
else:
D = Xtr.toarray()
#D[D < min_tfidf] = 0
tfidf_nonzero_means = np.nanmean(np.where(D!=0,D,np.nan), axis=0) #change 0 in NaN
return top_tfidf_feats(tfidf_nonzero_means, features, top_n)
@login_manager.user_loader
def load_user(user_id):
return User.query.get(int(user_id))
#Forms
class LoginForm(FlaskForm):
username = StringField('username', validators=[InputRequired(), Length(min=4, max=15)])
password = PasswordField('password', validators=[InputRequired(), Length(min=8, max=80)])
remember = BooleanField('remember me')
#we don't use this for the moment
class RegisterForm(FlaskForm):
email = StringField('email', validators=[InputRequired(), Email(message='Invalid email'), Length(max=50)])
username = StringField('username', validators=[InputRequired(), Length(min=4, max=15)])
password = PasswordField('password', validators=[InputRequired(), Length(min=8, max=80)])
role = 4
#search keywords by keyword
class SearchForm(FlaskForm):
myTLDLang = myconfig.myTLDLang
keyword = StringField('keyword / Expression', validators=[InputRequired(), Length(max=200)])
tldLang = SelectField('Country - Language', choices=myTLDLang, validators=[InputRequired()])
############### other functions
#####Get strings from tags
def getStringfromTag(tag="h1", soup="") :
theTag = soup.find_all(tag)
myTag = ""
for x in theTag:
myTag= myTag + " " + x.text.strip()
return myTag.strip()
#remove comments and non visible tags
def tag_visible(element):
if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
return False
if isinstance(element, Comment):
return False
return True
def strip_accents(text, encoding='utf-8'):
"""
Strip accents from input String.
:param text: The input string.
:type text: String.
:returns: The processed String.
:rtype: String.
"""
text = unicodedata.normalize('NFD', text)
text = text.encode('ascii', 'ignore')
text = text.decode(encoding)
return str(text)
# Get a random user agent.
def getRandomUserAgent(userAgentsList, userAgentLanguage):
theUserAgent = random.choice(userAgentsList)
if len(userAgentLanguage) > 0 :
theUserAgent = theUserAgent.replace("{{tagLang}}","; "+str(userAgentLanguage))
else :
theUserAgent = theUserAgent.replace("{{tagLang}}",""+str(userAgentLanguage))
return theUserAgent
#ngrams in list
def words_to_ngrams(words, n, sep=" "):
return [sep.join(words[i:i+n]) for i in range(len(words)-n+1)]
#one-gram tokenizer
tokenizer = nltk.RegexpTokenizer(r'\w+')
Routes de base
#################### WebSite ##################################
#Routes
@app.route('/')
def index():
return render_template('index.html')
@app.route('/login', methods=['GET', 'POST'])
def login():
form = LoginForm()
if form.validate_on_submit():
user = User.query.filter_by(username=form.username.data).first()
if user:
if check_password_hash(user.password, form.password.data):
login_user(user, remember=form.remember.data)
return redirect(url_for('keywordssuggest')) #go to the keywords Suggest page
return '<h1>Invalid password</h1>'
return '<h1>Invalid username</h1>'
#return '<h1>' + form.username.data + ' ' + form.password.data + '</h1>'
return render_template('login.html', form=form)
#Not used here.
@app.route('/signup', methods=['GET', 'POST'])
def signup():
form = RegisterForm()
if form.validate_on_submit():
hashed_password = generate_password_hash(form.password.data, method='sha256')
new_user = User(username=form.username.data, email=form.email.data, password=hashed_password)
db.session.add(new_user)
db.session.commit()
return '<h1>New user has been created!</h1>'
#return '<h1>' + form.username.data + ' ' + form.email.data + ' ' + form.password.data + '</h1>'
return render_template('signup.html', form=form)
#Not used here.
@app.route('/dashboard')
@login_required
def dashboard():
return render_template('dashboard.html', name=current_user.username)
@app.route('/logout')
@login_required
def logout():
logout_user()
return redirect(url_for('index'))
Ici la page d’inscription et la page de dashboard global ne sont pas utilisées pour l’instant.
Remarque : Flask utilise le concept de « route« . En développement Web, on appelle route une URL ou un ensemble d’URLs conduisant à l’exécution d’une fonction donnée.
Dans Flask, les routes sont déclarées via le décorateur app.route
Route TFIDFkeywordsuggest : saisie de l’expression et du pays/langue
#Route TFIDFkeywordssuggest
@app.route('/tfidfkeywordssuggest',methods=['GET', 'POST'])
@login_required
def tfidfkeywordssuggest():
print("tfidfkeywordssuggest")
#print("g.userId="+str(g.userId))
if current_user.is_authenticated: #always because @login_required
print("UserId= "+str(current_user.get_id()))
myUserId = current_user.get_id()
print("UserName = "+current_user.username)
dfSession.loc[ myUserId, 'userName'] = current_user.username #Save Username for userId
#make sure we have a good Role
if current_user.role is None or current_user.role > 4 or current_user.role <0 :
dfSession.loc[ myUserId,'role'] = 4 #4 is for guest
else :
dfSession.loc[ myUserId,'role'] = current_user.role
myRole = dfSession.loc[ myUserId,'role']
#count searches in a day
myLimitReached = False
mySearchesCount = db.session.query(KeywordUser).filter_by(username=current_user.username, search_date=date.today()).count()
print("mySearchesCount="+str(mySearchesCount))
print(" myconfig.myMaxSearchesByDay[myRole]="+str(myconfig.myMaxSearchesByDay[myRole]))
if (mySearchesCount >= myconfig.myMaxSearchesByDay[myRole]):
myLimitReached=True
#raz value
dfSession.loc[myUserId,'keyword'] = ""
dfSession.loc[myUserId,'tldLang'] =""
form = SearchForm()
if form.validate_on_submit():
dfSession.loc[myUserId,'keyword'] = form.keyword.data #save in session variable
dfSession.loc[myUserId,'tldLang'] = form.tldLang.data #save in session variable
dfSession.head()
return render_template('tfidfkeywordssuggest.html', name=current_user.username, form=form,
keyword = form.keyword.data , tldLang = form.tldLang.data,
role =myRole, MaxResults=myconfig.myMaxResults[myRole], limitReached=myLimitReached)
Cette route gère le formulaire de saisie, elle vérifie aussi si vous avez les droits pour faire une recherche.
Route PROGRESS
La route « progress » effectue les recherches à proprement parlé. Elle dialogue aussi avec le client (la page keywordssuggest.html) en envoyant des informations à celui-ci via la fonction generate qui est appelée en boucle.
Cette partie étant relativement longue nous la traiterons en plusieurs étapes.
Initialisation et récupération des données de session
@app.route('/progress')
def progress():
print("progress")
if current_user.is_authenticated:
myUserId = current_user.get_id()
print("myUserId="+str(myUserId))
dfScrap = pd.DataFrame(columns=['keyword', 'tldLang', 'page', 'position', 'source', 'search_date'])
def generate(dfScrap, myUserId ):
myUserName=dfSession.loc[ myUserId,'userName']
myRole = dfSession.loc[ myUserId,'role']
myKeyword = dfSession.loc[ myUserId,'keyword']
myTLDLang = dfSession.loc[myUserId,'tldLang']
myKeywordId = dfSession.loc[ myUserId,'keywordId']
print("myUserId : "+myUserId)
print("myUserName : "+myUserName)
print("myRole : "+str(myRole))
print("myKeyword : "+myKeyword)
print("myTLDLang : "+myTLDLang)
print("myKeywordId : "+str(myKeywordId))
mySearchesCount = db.session.query(KeywordUser).filter_by(username=myUserName, search_date=date.today()).count()
print("mySearchesCount="+str(mySearchesCount))
print(" myconfig.myMaxSearchesByDay[myRole]="+str(myconfig.myMaxSearchesByDay[myRole]))
if (mySearchesCount >= myconfig.myMaxSearchesByDay[myRole] ):
myKeyword=""
myTLDLang=""
myShow=-1 #Error
yield "data:" + str(myShow) + "\n\n" #to show error
Vérification s’il n’y a pas une recherche précédente identique effectuée depuis les 30 derniers jours.
##############################
#run
###############################
if ( len(myKeyword) > 0 and len(myTLDLang) >0) :
myDate=date.today()
print('myDate='+str(myDate))
goSearch=False #do we made a new search in Google ?
#did anybody already made this search during the last x days ????
firstKWC = db.session.query(Keyword).filter_by(keyword=myKeyword, tldLang=myTLDLang).first()
#lastSearchDate = Keyword.query.filter(keyword==myKeyword, tldLang==myTLDLang ).first().format('search_date')
if firstKWC is None:
goSearch=True
else:
myKeywordId=firstKWC.id
dfSession.loc[ myUserId,'keywordId']=myKeywordId #Save in the dfSession
print("last Search Date="+ str(firstKWC.search_date))
Delta = myDate - firstKWC.search_date
print(" Delta in days="+str(Delta.days))
if Delta.days > myconfig.myRefreshDelay : #30 by defaukt
goSearch=True
Scrap dans Google et sauvegarde dans la table des positions de la base de données
###############################################
# Search in Google and scrap Urls
###############################################
if ( len(myKeyword) > 0 and len(myTLDLang) >0 and goSearch) :
#get default language for tld
myTLD = myconfig.dfTLDLanguages.loc[myTLDLang, 'tld']
#myTLD=myTLD.strip()
print("myTLD="+myTLD+"!")
myHl = str(myconfig.dfTLDLanguages.loc[myTLDLang, 'hl'])
#myHl=myHl.strip()
print("myHl="+myHl+"!")
myLanguageResults = str(myconfig.dfTLDLanguages.loc[myTLDLang, 'lr'])
#myLanguageResults=myLanguageResults.strip()
print("myLanguageResults="+myLanguageResults+"!")
myCountryResults = str(myconfig.dfTLDLanguages.loc[myTLDLang, 'cr'])
#myCountryResults=myCountryResults.strip()
print("myCountryResults="+myCountryResults+"!")
myUserAgentLanguage = str(myconfig.dfTLDLanguages.loc[myTLDLang, 'userAgentLanguage'])
#myCountryResults=myCountryResults.strip()
print("myUserAgentLanguage="+myUserAgentLanguage+"!")
###############################
# Google Scrap
###############################
myNum=10
myStart=0
myStop=10 #get by ten
myMaxStart=myconfig.myMaxPagesToScrap #only 3 for test 10 in production
#myTbs= "qdr:m" #rsearch only last month not used
#tbs=myTbs,
#pause may be long to avoir blocking from Google
myLowPause=myconfig.myLowPause
myHighPause=myconfig.myHighPause
nbTrials = 0
#this may be long
while myStart < myMaxStart:
myShow= int(round(((myStart*50)/myMaxStart)+1)) #for the progress bar
yield "data:" + str(myShow) + "\n\n"
print("PASSAGE NUMBER :"+str(myStart))
print("Query:"+myKeyword)
#change user-agent and pause to avoid blocking by Google
myPause = random.randint(myLowPause,myHighPause) #long pause
print("Pause:"+str(myPause))
#change user_agent and provide local language in the User Agent
myUserAgent = getRandomUserAgent(myconfig.userAgentsList, myUserAgentLanguage)
#myUserAgent = googlesearch.get_random_user_agent()
print("UserAgent:"+str(myUserAgent))
df = pd.DataFrame(columns=['query', 'page', 'position', 'source']) #working dataframe
myPause=myPause*(nbTrials+1) #up the pause if trial get nothing
print("Pause:"+str(myPause))
try :
urls = googlesearch.search(query=myKeyword, tld=myTLD, lang=myHl, safe='off',
num=myNum, start=myStart, stop=myStop, domains=None, pause=myPause,
country=myCountryResults, extra_params={'lr': myLanguageResults}, tpe='', user_agent=myUserAgent)
df = pd.DataFrame(columns=['keyword', 'tldLang', 'page', 'position', 'source', 'search_date'])
for url in urls :
print("URL:"+url)
df.loc[df.shape[0],'page'] = url
df['keyword'] = myKeyword #fill with current keyword
df['tldLang'] = myTLDLang #fill with current country / tld lang
df['position'] = df.index.values + 1 + myStart #position = index +1 + myStart
df['source'] = "Scrap" #fill with source origin here scraping Google
#other potentials options : Semrush, Yooda Insight...
df['search_date'] = myDate
dfScrap = pd.concat([dfScrap, df], ignore_index=True) #concat scraps
# time.sleep(myPause) #add another pause
if (df.shape[0] > 0):
nbTrials = 0
myStart += 10
else :
nbTrials +=1
if (nbTrials > 3) :
nbTrials = 0
myStart += 10
#myStop += 10
except :
exc_type, exc_value, exc_traceback = sys.exc_info()
print("GOOGLE ERROR")
print(exc_type.__name__)
print(exc_value)
print(exc_traceback)
time.sleep(600) #add a big pause if you get an error.
#/while myStart < myMaxStart:
#dfScrap.info()
dfScrapUnique=dfScrap.drop_duplicates() #remove duplicates
#dfScrapUnique.info()
#Save in csv an json if needed
#dfScrapUnique.to_csv("dfScrapUnique.csv", sep=myconfig.myCsvSep, encoding='utf-8', index=False)
#dfScrapUnique.to_json("dfScrapUnique.json")
#Bulk save in position table
#save dataframe in table Position
dfScrapUnique.to_sql('position', con=db.engine, if_exists='append', index=False)
myShow=50
yield "data:" + str(myShow) + "\n\n" #to show 50 %
#/end search in Google
Récupération du contenu des pages Web et sauvegarde dans la table des pages.
###############################
# Go to get data from urls
###############################
#read urls to crawl in Position table
dfUrls = pd.read_sql_query(db.session.query(Position).filter_by(keyword= myKeyword, tldLang= myTLDLang).statement, con=db.engine)
dfUrls.info()
###### filter extensions
extensionsToCheck = ('.7z','.aac','.au','.avi','.bmp','.bzip','.css','.doc',
'.docx','.flv','.gif','.gz','.gzip','.ico','.jpg','.jpeg',
'.js','.mov','.mp3','.mp4','.mpeg','.mpg','.odb','.odf',
'.odg','.odp','.ods','.odt','.pdf','.png','.ppt','.pptx',
'.psd','.rar','.swf','.tar','.tgz','.txt','.wav','.wmv',
'.xls','.xlsx','.xml','.z','.zip')
indexGoodFile= dfUrls ['page'].apply(lambda x : not x.endswith(extensionsToCheck) )
dfUrls2=dfUrls.iloc[indexGoodFile.values]
dfUrls2.reset_index(inplace=True, drop=True)
dfUrls2.info()
#######################################################
# Scrap Urls only one time
########################################################
myPagesToScrap = dfUrls2['page'].unique()
dfPagesToScrap= pd.DataFrame(myPagesToScrap, columns=["page"])
#dfPagesToScrap.size #9
#add new variables
dfPagesToScrap['statusCode'] = np.nan
dfPagesToScrap['html'] = '' #
dfPagesToScrap['encoding'] = '' #
dfPagesToScrap['elapsedTime'] = np.nan
myShow=60
yield "data:" + str(myShow) + "\n\n" #to show 60%
stepShow = 10/len(dfPagesToScrap)
print("stepShow scrap urls="+str(stepShow ))
for i in range(0,len(dfPagesToScrap)) :
url = dfPagesToScrap.loc[i, 'page']
print("Page i = "+url+" "+str(i))
startTime = time.time()
try:
#html = urllib.request.urlopen(url).read()$
r = requests.get(url,timeout=(5, 14)) #request
dfPagesToScrap.loc[i,'statusCode'] = r.status_code
print('Status_code '+str(dfPagesToScrap.loc[i,'statusCode']))
if r.status_code == 200. : #can't decode utf-7
print("Encoding="+str(r.encoding))
dfPagesToScrap.loc[i,'encoding'] = r.encoding
if r.encoding == 'UTF-7' : #don't get utf-7 content pb with dbd
dfPagesToScrap.loc[i, 'html'] =""
print("UTF-7 ok page ")
else :
dfPagesToScrap.loc[i, 'html'] = r.text
#au format texte r.text - pas bytes : r.content
print("ok page ")
#print(dfPagesToScrap.loc[i, 'html'] )
except:
print("Error page requests ")
endTime= time.time()
dfPagesToScrap.loc[i, 'elapsedTime'] = endTime - startTime
print('pas scrap URL='+str(round((stepShow*i))))
myShow=60+round((stepShow*i))
yield "data:" + str(myShow) + "\n\n" #to show 60%
#/
dfPagesToScrap.info()
#merge dfUrls2, dfPagesToScrap -> dfUrls3
dfUrls3 = pd.merge(dfUrls2, dfPagesToScrap, on='page', how='left')
#keep only status code = 200
dfUrls3 = dfUrls3.loc[dfUrls3['statusCode'] == 200]
#dfUrls3 = dfUrls3.loc[dfUrls3['encoding'] != 'UTF-7'] #can't save utf-7 content in db ????
dfUrls3 = dfUrls3.loc[dfUrls3['html'] != ""] #don't get empty html
dfUrls3.reset_index(inplace=True, drop=True)
dfUrls3.info() #
dfUrls3 = dfUrls3.dropna() #remove rows with at least one na
dfUrls3.reset_index(inplace=True, drop=True)
dfUrls3.info() #
myShow=70
yield "data:" + str(myShow) + "\n\n" #to show 70%
#Get Body contents from html
dfUrls3['body'] = "" #Empty String
stepShow = 10/len(dfUrls3)
for i in range(0,len(dfUrls3)) :
print("Page keyword tldLang i = "+ dfUrls3.loc[i, 'page']+" "+ dfUrls3.loc[i, 'keyword']+" "+ dfUrls3.loc[i, 'tldLang']+" "+str(i))
encoding = dfUrls3.loc[i, 'encoding'] #get previously
print("get body content encoding"+encoding)
try:
soup = BeautifulSoup( dfUrls3.loc[i, 'html'], 'html.parser')
except :
soup=""
if len(soup) != 0 :
#TBody Content
texts = soup.findAll(text=True)
visible_texts = filter(tag_visible, texts)
myBody = " ".join(t.strip() for t in visible_texts)
myBody=myBody.strip()
#myBody = strip_accents(myBody, encoding).lower() #think to do a global clean instead
myBody=" ".join(myBody.split(" ")) #remove multiple spaces
#print(myBody)
dfUrls3.loc[i, 'body'] = myBody
print('pas body content='+str(round((stepShow*i))))
myShow=70+round((stepShow*i))
yield "data:" + str(myShow) + "\n\n" #to show 70% ++
################################
#save pages in database table page
dfPages= dfUrls3[['page', 'statusCode', 'html', 'encoding', 'elapsedTime', 'body', 'search_date']]
dfPagesUnique = dfPages.drop_duplicates(subset='page') #remove duplicate's pages
dfPagesUnique = dfPagesUnique.dropna() #remove na
dfPagesUnique.reset_index(inplace=True, drop=True) #reset index
#dfPagesUnique.to_sql('page', con=db.engine, if_exists='append', index=False) #duplicate risks !!
#save to see what we get
#dfPagesUnique.to_csv("dfPagesUnique.csv", sep=myconfig.myCsvSep , encoding='utf-8', index=False)
#dfPagesUnique.to_json("dfPagesUnique.json")
myShow=80
yield "data:" + str(myShow) + "\n\n" #to show 90%
#insert or update in Page table
print("len df="+str( len(dfPagesUnique)))
stepShow = 10/len(dfPagesUnique)
for i in range(0, len(dfPagesUnique)) :
print("i="+str(i))
print("page = "+dfPagesUnique.loc[i, 'page'])
dbPage = db.session.query(Page).filter_by(page=dfPagesUnique.loc[i, 'page']).first()
if dbPage is None :
print("nothing insert index = "+str(i))
newPage = Page(page=dfPagesUnique.loc[i, 'page'],
statusCode=dfPagesUnique.loc[i, 'statusCode'],
html=dfPagesUnique.loc[i, 'html'],
encoding=dfPagesUnique.loc[i, 'encoding'],
elapsedTime=dfPagesUnique.loc[i, 'elapsedTime'],
body=dfPagesUnique.loc[i, 'body'],
search_date=dfPagesUnique.loc[i, 'search_date'])
db.session.add(newPage)
db.session.commit()
else :
print("exists update id = "+str(dbPage.id))
#update values
dbPage.page=dfPagesUnique.loc[i, 'page']
dbPage.statusCode=dfPagesUnique.loc[i, 'statusCode']
dbPage.html=dfPagesUnique.loc[i, 'html']
dbPage.encoding=dfPagesUnique.loc[i, 'encoding']
dbPage.elapsedTime=dfPagesUnique.loc[i, 'elapsedTime']
dbPage.body=dfPagesUnique.loc[i, 'body'],
dbPage.search_date=dfPagesUnique.loc[i, 'search_date']
db.session.commit()
myShow=80+round((stepShow*i))
yield "data:" + str(myShow) + "\n\n" #to show 80% ++
###End Google search and scrap content page
myShow=90
yield "data:" + str(myShow) + "\n\n" #to show 90%
Sauvegarde dans la table des mots clés et la table des mots clés par utilisateur
###################################
#update keyword and keyworduser
###################################
#need to get firtsKWC in keyword table before
firstKWC = db.session.query(Keyword).filter_by(keyword=myKeyword, tldLang=myTLDLang).first()
#Do we just process a new Google Scrap and Page Scrap ?
if goSearch :
myDataDate = myDate #Today
else :
if firstKWC is None :
myDataDate = myDate #Today
else :
myDataDate = firstKWC.data_date #old data date
#do somebody already process a research before ?
if firstKWC is None :
#insert
newKeyword = Keyword(keyword= myKeyword, tldLang=myTLDLang , data_date=myDataDate, search_date=myDate)
db.session.add(newKeyword)
db.session.commit()
db.session.refresh(newKeyword)
db.session.commit()
myKeywordId = newKeyword.id #
else :
myKeywordId = firstKWC.id
#update
firstKWC.data_date=myDataDate
firstKWC.search_date=myDate
db.session.commit()
myShow=91
yield "data:" + str(myShow) + "\n\n" #to show 91%
#for KeywordUSer
#Did this user already process this search ?
print(" myKeywordId="+str(myKeywordId))
dfSession.loc[ myUserId,'keywordId']=myKeywordId
dbKeywordUser = db.session.query(KeywordUser).filter_by(keyword= myKeyword, tldLang=myTLDLang, username=myUserName).first()
if dbKeywordUser is None :
print("insert index new Keyword for = "+ myUserName)
newKeywordUser = KeywordUser(keywordId= myKeywordId, keyword= myKeyword,
tldLang=myTLDLang , username= myUserName, data_date=myDataDate, search_date=myDate)
db.session.add(newKeywordUser)
db.session.commit()
myKeywordUserId=newKeywordUser.id
else :
myKeywordUserId=dbKeywordUser.id #for the name
print("exists update only myDataDate" )
#update values for the current user
dbKeywordUser.data_date=myDataDate
dbKeywordUser.search_date=myDate
db.session.commit()
Création des fichiers de mots clés calculés par TF-IDF (et fin de /progress)
######################################################
####################### tf-idf files generation
dfSession.loc[ myUserId,'keywordUserId']=myKeywordUserId
#Make sure download directory exists
myDirectory = myScriptDirectory+myconfig.UPLOAD_SUBDIRECTORY+"/"+myUserName
if not os.path.exists(myDirectory):
os.makedirs(myDirectory)
myKeywordFileNameString=strip_accents(myKeyword).lower()
myKeywordFileNameString = "-".join(myKeywordFileNameString.split(" "))
myKeywordFileNameString = myKeywordFileNameString+"_"+myTLDLang
myKeywordFileNameString = str(myKeywordId)+"-"+str(myKeywordUserId)+"_"+myKeywordFileNameString
print("myKeywordFileNameString = "+myKeywordFileNameString)
myShow=92
yield "data:" + str(myShow) + "\n\n" #to show 92%
#Read in position table to get the pages list in dataframe
dfPagesUnique = pd.read_sql_query(db.session.query(Position, Page).filter_by(keyword= myKeyword, tldLang= myTLDLang).filter(Position.page==Page.page).statement, con=db.engine)
dfPagesUnique.info()
#Remove apostrophes and quotes
print("Remove apostrophes and quotes")
stopQBody = dfPagesUnique['body'].apply(lambda x: x.replace("\"", " "))
stopAQBody =stopQBody.apply(lambda x: x.replace("'", " "))
#Remove english stopwords
print("Remove English stopwords")
stopEnglish = stopwords.words('english')
stopEnglishBody = stopAQBody.apply(lambda x: ' '.join([word for word in x.split() if word not in ( stopEnglish)]))
#Get the good local stopwords
stopLocalLanguage = myconfig.dfTLDLanguages.loc[myTLDLang, 'stopWords']
if (stopLocalLanguage in stopwords.fileids()) :
print(" stopLocalLanguage="+ stopLocalLanguage)
stopLocal = stopwords.words(stopLocalLanguage)
print("Remove local Stop Words")
stopLocalBody = stopEnglishBody.apply(lambda x: ' '.join([word for word in x.split() if word not in (stopLocal)]))
else :
stopLocalBody = stopEnglishBody
print("Remove Special Characters")
stopSCBody = stopLocalBody.apply(lambda x: re.sub(r"[-()\"#/@;:<>{}`+=~|.!?,]", " ", x))
print("Remove Numbers")
#remove numbers
stopNumbersBody = stopSCBody.apply(lambda x: ''.join(i for i in x if not i.isdigit()))
print("Remove Multiple Spaces")
stopSpacesBody = stopNumbersBody.apply(lambda x: re.sub(" +", " ", x))
#print("Encode in UTF-8")
#stopEncodeBody = stopSpacesBody.apply(lambda x: x.encode('utf-8', 'ignore'))
stopEncodeBody= stopSpacesBody #already in utf-8
#create "clean" Corpus
corpus = stopEncodeBody.tolist()
print('corpus Size='+str(len(corpus)))
myMaxFeatures = myconfig.myMaxFeatures
myMaxResults = myconfig.myMaxResults
print("Popular Expressions")
print("Mean for min to max words")
tf_idf_vectMinMax = TfidfVectorizer(ngram_range=(myconfig.myMinNGram,myconfig.myMaxNGram), max_features=myMaxFeatures) # , norm=None
XtrMinMax = tf_idf_vectMinMax.fit_transform(corpus)
featuresMinMax = tf_idf_vectMinMax.get_feature_names()
myTopNMinMax=min(len(featuresMinMax), myMaxResults[myRole])
dfTopMinMax = top_mean_feats(Xtr=XtrMinMax, features=featuresMinMax, grp_ids=None, top_n= myTopNMinMax)
dfTopMinMax.to_csv(myDirectory+"/pop-"+myKeywordFileNameString+"-min-max.csv", sep=myconfig.myCsvSep , encoding='utf-8', index=False)
myShow=93
yield "data:" + str(myShow) + "\n\n" #to show 93%
print("for 1 word")
#Keywords suggestion
# for 1 word
tf_idf_vect1 = TfidfVectorizer(ngram_range=(1,1), max_features=myMaxFeatures) # , norm=None
Xtr1 = tf_idf_vect1.fit_transform(corpus)
features1 = tf_idf_vect1.get_feature_names()
myTopN1=min(len(features1), myMaxResults[myRole])
dfTop1 = top_mean_feats(Xtr=Xtr1, features=features1, grp_ids=None, top_n=myTopN1)
dfTop1.to_csv(myDirectory+"/pop-"+myKeywordFileNameString+"-1.csv", sep=myconfig.myCsvSep , encoding='utf-8', index=False)
#for 2
print("for 2 words")
tf_idf_vect2 = TfidfVectorizer(ngram_range=(2,2), max_features=myMaxFeatures) # , norm=None
Xtr2 = tf_idf_vect2.fit_transform(corpus)
features2 = tf_idf_vect2.get_feature_names()
myTopN2=min(len(features2), myMaxResults[myRole])
dfTop2 = top_mean_feats(Xtr=Xtr2, features=features2, grp_ids=None, top_n=myTopN2)
dfTop2.to_csv(myDirectory+"/pop-"+myKeywordFileNameString+"-2.csv", sep=myconfig.myCsvSep, encoding='utf-8', index=False)
#for 3
print("for 3 words")
tf_idf_vect3 = TfidfVectorizer(ngram_range=(3,3), max_features=myMaxFeatures) # , norm=None
Xtr3 = tf_idf_vect3.fit_transform(corpus)
features3 = tf_idf_vect3.get_feature_names()
myTopN3=min(len(features3), myMaxResults[myRole])
dfTop3 = top_mean_feats(Xtr=Xtr3, features=features3, grp_ids=None, top_n=myTopN3)
dfTop3.to_csv(myDirectory+"/pop-"+myKeywordFileNameString+"-3.csv", sep=myconfig.myCsvSep, encoding='utf-8', index=False)
myShow=94
yield "data:" + str(myShow) + "\n\n" #to show 94%
#for 4
print("for 4 words")
tf_idf_vect4 = TfidfVectorizer(ngram_range=(4,4), max_features=myMaxFeatures) # , norm=None
Xtr4 = tf_idf_vect4.fit_transform(corpus)
features4 = tf_idf_vect4.get_feature_names()
myTopN4=min(len(features4), myMaxResults[myRole])
dfTop4 = top_mean_feats(Xtr=Xtr4, features=features4, grp_ids=None, top_n=myTopN4)
dfTop4.to_csv(myDirectory+"/pop-"+myKeywordFileNameString+"-4.csv", sep=myconfig.myCsvSep, encoding='utf-8', index=False)
#for 5
print("for 5 words")
tf_idf_vect5 = TfidfVectorizer(ngram_range=(5,5), max_features=myMaxFeatures) # , norm=None
Xtr5 = tf_idf_vect5.fit_transform(corpus)
features5 = tf_idf_vect5.get_feature_names()
myTopN5=min(len(features5), myMaxResults[myRole])
dfTop5 = top_mean_feats(Xtr=Xtr5, features=features5, grp_ids=None, top_n=myTopN5)
dfTop5.to_csv(myDirectory+"/pop-"+myKeywordFileNameString+"-5.csv", sep=myconfig.myCsvSep, encoding='utf-8', index=False)
#for 6
print("for 6 words")
tf_idf_vect6 = TfidfVectorizer(ngram_range=(6,6), max_features=myMaxFeatures) # , norm=None
Xtr6 = tf_idf_vect6.fit_transform(corpus)
features6 = tf_idf_vect6.get_feature_names()
myTopN6=min(len(features6), myMaxResults[myRole])
dfTop6 = top_mean_feats(Xtr=Xtr6, features=features6, grp_ids=None, top_n=myTopN6)
dfTop6.to_csv(myDirectory+"/pop-"+myKeywordFileNameString+"-6.csv", sep=myconfig.myCsvSep, encoding='utf-8', index=False)
myShow=95
yield "data:" + str(myShow) + "\n\n" #to show 95%
print("Original Expressions")
print("NZ Mean for min to max words")
dfTopNZMinMax = top_nonzero_mean_feats(Xtr=XtrMinMax, features=featuresMinMax, grp_ids=None, top_n=myTopNMinMax)
dfTopNZMinMax.to_csv(myDirectory+"/ori-"+myKeywordFileNameString+"-min-max.csv", sep=myconfig.myCsvSep, encoding='utf-8', index=False)
myShow=96
yield "data:" + str(myShow) + "\n\n" #to show 96%
#for 1
print("NZ for 1 word")
dfTopNZ1 = top_nonzero_mean_feats(Xtr=Xtr1, features=features1, grp_ids=None, top_n=myTopN1)
dfTopNZ1.to_csv(myDirectory+"/ori-"+myKeywordFileNameString+"-1.csv", sep=myconfig.myCsvSep, encoding='utf-8', index=False)
#for 2
print("NZ for 2 words")
dfTopNZ2 = top_nonzero_mean_feats(Xtr=Xtr2, features=features2, grp_ids=None, top_n=myTopN2)
dfTopNZ2.to_csv(myDirectory+"/ori-"+myKeywordFileNameString+"-2.csv", sep=myconfig.myCsvSep, encoding='utf-8', index=False)
myShow=97
yield "data:" + str(myShow) + "\n\n" #to show 97%
#for 3
print("NZ for 3 words")
dfTopNZ3 = top_nonzero_mean_feats(Xtr=Xtr3, features=features3, grp_ids=None, top_n=myTopN3)
dfTopNZ3.to_csv(myDirectory+"/ori-"+myKeywordFileNameString+"-3.csv", sep=myconfig.myCsvSep, encoding='utf-8', index=False)
# for 4
print("NZ for 4 words")
dfTopNZ4 = top_nonzero_mean_feats(Xtr=Xtr4, features=features4, grp_ids=None, top_n=myTopN4)
dfTopNZ4.to_csv(myDirectory+"/ori-"+myKeywordFileNameString+"-4.csv", sep=myconfig.myCsvSep, encoding='utf-8', index=False)
#for 5
print("NZ for 5 words")
dfTopNZ5 = top_nonzero_mean_feats(Xtr=Xtr5, features=features5, grp_ids=None, top_n=myTopN5)
dfTopNZ5.to_csv(myDirectory+"/ori-"+myKeywordFileNameString+"-5.csv", sep=myconfig.myCsvSep, encoding='utf-8', index=False)
myShow=99
yield "data:" + str(myShow) + "\n\n" #to show 99%
#for 6
print("NZ for 6 words")
dfTopNZ6 = top_nonzero_mean_feats(Xtr=Xtr6, features=features6, grp_ids=None, top_n=myTopN6)
dfTopNZ6.to_csv(myDirectory+"/ori-"+myKeywordFileNameString+"-6.csv", sep=myconfig.myCsvSep, encoding='utf-8', index=False)
#Finish
myShow=100
yield "data:" + str(myShow) + "\n\n" #to show 100% and close
#/run
#loop generate
return Response(generate(dfScrap, myUserId), mimetype='text/event-stream')
Routes pour télécharger les fichiers
#Download keywords File filename
@app.route('/downloadKWF/<path:filename>', methods=['GET', 'POST'] ) # this is a job for GET, not POST
def downloadKWF(filename):
#get Session Variables
if current_user.is_authenticated:
myUserId = current_user.get_id()
print("myUserId="+str(myUserId))
myUserName=dfSession.loc[ myUserId,'userName']
print("myUserId="+str(myUserId))
myScriptDirectory = get_script_directory()
myDirectory = myScriptDirectory+myconfig.UPLOAD_SUBDIRECTORY+"/"+myUserName
myFileName=filename
print("myFileName="+myFileName)
return send_file(myDirectory+"/"+myFileName,
mimetype='text/csv',
attachment_filename=myFileName,
as_attachment=True)
#Download Popular Keywords All Keywords
@app.route('/popAllCSV')
def popAllCSV():
#get Session Variables
if current_user.is_authenticated:
myUserId = current_user.get_id()
print("myUserId="+str(myUserId))
myKeyword = dfSession.loc[ myUserId,'keyword']
myTLDLang = dfSession.loc[myUserId,'tldLang']
myKeywordId = dfSession.loc[ myUserId,'keywordId']
myKeywordUserId = dfSession.loc[ myUserId,'keywordUserId']
print("myUserId="+str(myUserId))
myKeywordFileNameString=strip_accents(myKeyword).lower()
myKeywordFileNameString = "-".join(myKeywordFileNameString.split(" "))
myKeywordFileNameString = myKeywordFileNameString+"_"+myTLDLang
myKeywordFileNameString = str(myKeywordId)+"-"+str(myKeywordUserId)+"_"+myKeywordFileNameString
print("myKeywordFileNameString = "+myKeywordFileNameString)
myFileName="pop-"+myKeywordFileNameString+"-min-max.csv"
return redirect(url_for('downloadKWF', filename=myFileName))
#Download Popular Keywords 1 gram Keywords
@app.route('/pop1CSV')
def pop1CSV():
#get Session Variables
if current_user.is_authenticated:
myUserId = current_user.get_id()
print("myUserId="+str(myUserId))
myKeyword = dfSession.loc[ myUserId,'keyword']
myTLDLang = dfSession.loc[myUserId,'tldLang']
myKeywordId = dfSession.loc[ myUserId,'keywordId']
myKeywordUserId = dfSession.loc[ myUserId,'keywordUserId']
print("myUserId="+str(myUserId))
myKeywordFileNameString=strip_accents(myKeyword).lower()
myKeywordFileNameString = "-".join(myKeywordFileNameString.split(" "))
myKeywordFileNameString = myKeywordFileNameString+"_"+myTLDLang
myKeywordFileNameString = str(myKeywordId)+"-"+str(myKeywordUserId)+"_"+myKeywordFileNameString
print("myKeywordFileNameString = "+myKeywordFileNameString)
myFileName="pop-"+myKeywordFileNameString+"-1.csv"
return redirect(url_for('downloadKWF', filename=myFileName))
#Download Popular Keywords 2 gram Keywords
@app.route('/pop2CSV')
def pop2CSV():
#get Session Variables
if current_user.is_authenticated:
myUserId = current_user.get_id()
print("myUserId="+str(myUserId))
myKeyword = dfSession.loc[ myUserId,'keyword']
myTLDLang = dfSession.loc[myUserId,'tldLang']
myKeywordId = dfSession.loc[ myUserId,'keywordId']
myKeywordUserId = dfSession.loc[ myUserId,'keywordUserId']
print("myUserId="+str(myUserId))
myKeywordFileNameString=strip_accents(myKeyword).lower()
myKeywordFileNameString = "-".join(myKeywordFileNameString.split(" "))
myKeywordFileNameString = myKeywordFileNameString+"_"+myTLDLang
myKeywordFileNameString = str(myKeywordId)+"-"+str(myKeywordUserId)+"_"+myKeywordFileNameString
print("myKeywordFileNameString = "+myKeywordFileNameString)
myFileName="pop-"+myKeywordFileNameString+"-2.csv"
return redirect(url_for('downloadKWF', filename=myFileName))
#Download Popular Keywords 3 gram Keywords
@app.route('/pop3CSV')
def pop3CSV():
#get Session Variables
if current_user.is_authenticated:
myUserId = current_user.get_id()
print("myUserId="+str(myUserId))
myKeyword = dfSession.loc[ myUserId,'keyword']
myTLDLang = dfSession.loc[myUserId,'tldLang']
myKeywordId = dfSession.loc[ myUserId,'keywordId']
myKeywordUserId = dfSession.loc[ myUserId,'keywordUserId']
print("myUserId="+str(myUserId))
myKeywordFileNameString=strip_accents(myKeyword).lower()
myKeywordFileNameString = "-".join(myKeywordFileNameString.split(" "))
myKeywordFileNameString = myKeywordFileNameString+"_"+myTLDLang
myKeywordFileNameString = str(myKeywordId)+"-"+str(myKeywordUserId)+"_"+myKeywordFileNameString
print("myKeywordFileNameString = "+myKeywordFileNameString)
myFileName="pop-"+myKeywordFileNameString+"-3.csv"
return redirect(url_for('downloadKWF', filename=myFileName))
#Download Popular Keywords 4 gram Keywords
@app.route('/pop4CSV')
def pop4CSV():
#get Session Variables
if current_user.is_authenticated:
myUserId = current_user.get_id()
print("myUserId="+str(myUserId))
myKeyword = dfSession.loc[ myUserId,'keyword']
myTLDLang = dfSession.loc[myUserId,'tldLang']
myKeywordId = dfSession.loc[ myUserId,'keywordId']
myKeywordUserId = dfSession.loc[ myUserId,'keywordUserId']
print("myUserId="+str(myUserId))
myKeywordFileNameString=strip_accents(myKeyword).lower()
myKeywordFileNameString = "-".join(myKeywordFileNameString.split(" "))
myKeywordFileNameString = myKeywordFileNameString+"_"+myTLDLang
myKeywordFileNameString = str(myKeywordId)+"-"+str(myKeywordUserId)+"_"+myKeywordFileNameString
print("myKeywordFileNameString = "+myKeywordFileNameString)
myFileName="pop-"+myKeywordFileNameString+"-4.csv"
return redirect(url_for('downloadKWF', filename=myFileName))
#Download Popular Keywords 5 gram Keywords
@app.route('/pop5CSV')
def pop5CSV():
#get Session Variables
if current_user.is_authenticated:
myUserId = current_user.get_id()
print("myUserId="+str(myUserId))
myKeyword = dfSession.loc[ myUserId,'keyword']
myTLDLang = dfSession.loc[myUserId,'tldLang']
myKeywordId = dfSession.loc[ myUserId,'keywordId']
myKeywordUserId = dfSession.loc[ myUserId,'keywordUserId']
print("myUserId="+str(myUserId))
myKeywordFileNameString=strip_accents(myKeyword).lower()
myKeywordFileNameString = "-".join(myKeywordFileNameString.split(" "))
myKeywordFileNameString = myKeywordFileNameString+"_"+myTLDLang
myKeywordFileNameString = str(myKeywordId)+"-"+str(myKeywordUserId)+"_"+myKeywordFileNameString
print("myKeywordFileNameString = "+myKeywordFileNameString)
myFileName="pop-"+myKeywordFileNameString+"-5.csv"
return redirect(url_for('downloadKWF', filename=myFileName))
#Download Popular Keywords 5 gram Keywords
@app.route('/pop6CSV')
def pop6CSV():
#get Session Variables
if current_user.is_authenticated:
myUserId = current_user.get_id()
print("myUserId="+str(myUserId))
myKeyword = dfSession.loc[ myUserId,'keyword']
myTLDLang = dfSession.loc[myUserId,'tldLang']
myKeywordId = dfSession.loc[ myUserId,'keywordId']
myKeywordUserId = dfSession.loc[ myUserId,'keywordUserId']
print("myUserId="+str(myUserId))
myKeywordFileNameString=strip_accents(myKeyword).lower()
myKeywordFileNameString = "-".join(myKeywordFileNameString.split(" "))
myKeywordFileNameString = myKeywordFileNameString+"_"+myTLDLang
myKeywordFileNameString = str(myKeywordId)+"-"+str(myKeywordUserId)+"_"+myKeywordFileNameString
print("myKeywordFileNameString = "+myKeywordFileNameString)
myFileName="pop-"+myKeywordFileNameString+"-6.csv"
return redirect(url_for('downloadKWF', filename=myFileName))
########################################################
# Original Keywords
########################################################
#Download Original Keywords All Keywords
@app.route('/oriAllCSV')
def oriAllCSV():
#get Session Variables
if current_user.is_authenticated:
myUserId = current_user.get_id()
print("myUserId="+str(myUserId))
myKeyword = dfSession.loc[ myUserId,'keyword']
myTLDLang = dfSession.loc[myUserId,'tldLang']
myKeywordId = dfSession.loc[ myUserId,'keywordId']
myKeywordUserId = dfSession.loc[ myUserId,'keywordUserId']
myKeywordFileNameString=strip_accents(myKeyword).lower()
myKeywordFileNameString = "-".join(myKeywordFileNameString.split(" "))
myKeywordFileNameString = myKeywordFileNameString+"_"+myTLDLang
myKeywordFileNameString = str(myKeywordId)+"-"+str(myKeywordUserId)+"_"+myKeywordFileNameString
print("myKeywordFileNameString = "+myKeywordFileNameString)
myFileName="ori-"+myKeywordFileNameString+"-min-max.csv"
return redirect(url_for('downloadKWF', filename=myFileName))
#Download original Keywords 1 gram Keywords
@app.route('/ori1CSV')
def ori1CSV():
#get Session Variables
if current_user.is_authenticated:
myUserId = current_user.get_id()
print("myUserId="+str(myUserId))
myKeyword = dfSession.loc[ myUserId,'keyword']
myTLDLang = dfSession.loc[myUserId,'tldLang']
myKeywordId = dfSession.loc[ myUserId,'keywordId']
myKeywordUserId = dfSession.loc[ myUserId,'keywordUserId']
print("myUserId="+str(myUserId))
myKeywordFileNameString=strip_accents(myKeyword).lower()
myKeywordFileNameString = "-".join(myKeywordFileNameString.split(" "))
myKeywordFileNameString = myKeywordFileNameString+"_"+myTLDLang
myKeywordFileNameString = str(myKeywordId)+"-"+str(myKeywordUserId)+"_"+myKeywordFileNameString
print("myKeywordFileNameString = "+myKeywordFileNameString)
myFileName="ori-"+myKeywordFileNameString+"-1.csv"
return redirect(url_for('downloadKWF', filename=myFileName))
#Download original Keywords 2 gram Keywords
@app.route('/ori2CSV')
def ori2CSV():
#get Session Variables
if current_user.is_authenticated:
myUserId = current_user.get_id()
print("myUserId="+str(myUserId))
myKeyword = dfSession.loc[ myUserId,'keyword']
myTLDLang = dfSession.loc[myUserId,'tldLang']
myKeywordId = dfSession.loc[ myUserId,'keywordId']
myKeywordUserId = dfSession.loc[ myUserId,'keywordUserId']
print("myUserId="+str(myUserId))
myKeywordFileNameString=strip_accents(myKeyword).lower()
myKeywordFileNameString = "-".join(myKeywordFileNameString.split(" "))
myKeywordFileNameString = myKeywordFileNameString+"_"+myTLDLang
myKeywordFileNameString = str(myKeywordId)+"-"+str(myKeywordUserId)+"_"+myKeywordFileNameString
print("myKeywordFileNameString = "+myKeywordFileNameString)
myFileName="ori-"+myKeywordFileNameString+"-2.csv"
return redirect(url_for('downloadKWF', filename=myFileName))
#Download original Keywords 3 gram Keywords
@app.route('/ori3CSV')
def ori3CSV():
#get Session Variables
if current_user.is_authenticated:
myUserId = current_user.get_id()
print("myUserId="+str(myUserId))
myKeyword = dfSession.loc[ myUserId,'keyword']
myTLDLang = dfSession.loc[myUserId,'tldLang']
myKeywordId = dfSession.loc[ myUserId,'keywordId']
myKeywordUserId = dfSession.loc[ myUserId,'keywordUserId']
print("myUserId="+str(myUserId))
myKeywordFileNameString=strip_accents(myKeyword).lower()
myKeywordFileNameString = "-".join(myKeywordFileNameString.split(" "))
myKeywordFileNameString = myKeywordFileNameString+"_"+myTLDLang
myKeywordFileNameString = str(myKeywordId)+"-"+str(myKeywordUserId)+"_"+myKeywordFileNameString
print("myKeywordFileNameString = "+myKeywordFileNameString)
myFileName="ori-"+myKeywordFileNameString+"-3.csv"
return redirect(url_for('downloadKWF', filename=myFileName))
#Download original Keywords 4 gram Keywords
@app.route('/ori4CSV')
def ori4CSV():
#get Session Variables
if current_user.is_authenticated:
myUserId = current_user.get_id()
print("myUserId="+str(myUserId))
myKeyword = dfSession.loc[ myUserId,'keyword']
myTLDLang = dfSession.loc[myUserId,'tldLang']
myKeywordId = dfSession.loc[ myUserId,'keywordId']
myKeywordUserId = dfSession.loc[ myUserId,'keywordUserId']
print("myUserId="+str(myUserId))
myKeywordFileNameString=strip_accents(myKeyword).lower()
myKeywordFileNameString = "-".join(myKeywordFileNameString.split(" "))
myKeywordFileNameString = myKeywordFileNameString+"_"+myTLDLang
myKeywordFileNameString = str(myKeywordId)+"-"+str(myKeywordUserId)+"_"+myKeywordFileNameString
print("myKeywordFileNameString = "+myKeywordFileNameString)
myFileName="ori-"+myKeywordFileNameString+"-4.csv"
return redirect(url_for('downloadKWF', filename=myFileName))
#Download original Keywords 5 gram Keywords
@app.route('/ori5CSV')
def ori5CSV():
#get Session Variables
if current_user.is_authenticated:
myUserId = current_user.get_id()
print("myUserId="+str(myUserId))
myKeyword = dfSession.loc[ myUserId,'keyword']
myTLDLang = dfSession.loc[myUserId,'tldLang']
myKeywordId = dfSession.loc[ myUserId,'keywordId']
myKeywordUserId = dfSession.loc[ myUserId,'keywordUserId']
print("myUserId="+str(myUserId))
myKeywordFileNameString=strip_accents(myKeyword).lower()
myKeywordFileNameString = "-".join(myKeywordFileNameString.split(" "))
myKeywordFileNameString = myKeywordFileNameString+"_"+myTLDLang
myKeywordFileNameString = str(myKeywordId)+"-"+str(myKeywordUserId)+"_"+myKeywordFileNameString
print("myKeywordFileNameString = "+myKeywordFileNameString)
myFileName="ori-"+myKeywordFileNameString+"-5.csv"
return redirect(url_for('downloadKWF', filename=myFileName))
#Download original Keywords 5 gram Keywords
@app.route('/ori6CSV')
def ori6CSV():
#get Session Variables
if current_user.is_authenticated:
myUserId = current_user.get_id()
print("myUserId="+str(myUserId))
myKeyword = dfSession.loc[ myUserId,'keyword']
myTLDLang = dfSession.loc[myUserId,'tldLang']
myKeywordId = dfSession.loc[ myUserId,'keywordId']
myKeywordUserId = dfSession.loc[ myUserId,'keywordUserId']
print("myUserId="+str(myUserId))
myKeywordFileNameString=strip_accents(myKeyword).lower()
myKeywordFileNameString = "-".join(myKeywordFileNameString.split(" "))
myKeywordFileNameString = myKeywordFileNameString+"_"+myTLDLang
myKeywordFileNameString = str(myKeywordId)+"-"+str(myKeywordUserId)+"_"+myKeywordFileNameString
print("myKeywordFileNameString = "+myKeywordFileNameString)
myFileName="ori-"+myKeywordFileNameString+"-6.csv"
return redirect(url_for('downloadKWF', filename=myFileName))
Lancement de l’application TF-IDF Keywords Suggest
if __name__ == '__main__':
app.run()
# app.run(debug=True, use_reloader=True)
Page template tfidfkeywordsuggest.html
Cette template comporte notamment le javascript qui permet de récupérer les événements du serveur et d’afficher la barre de progression.
{% extends "bootstrap/base.html" %}
{% import "bootstrap/wtf.html" as wtf %}
{% block title %}
TF-IDF Keywords Suggest
{% endblock %}
{% block content %}
<nav class="navbar navbar-inverse navbar-fixed-top">
<div class="container-fluid">
<div class="navbar-header">
<button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#navbar" aria-expanded="false" aria-controls="navbar">
<span class="sr-only">Toggle navigation</span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
<a class="navbar-brand" href="{{url_for('index')}}"><img src="{{url_for('static', filename='Oeil_Anakeyn.jpg')}}", align="left", width=30 /></a>
</div>
<div id="navbar" class="navbar-collapse collapse">
<ul class="nav navbar-nav navbar-right">
<!--
<li><a href="#">Settings</a></li>
<li><a href="#">Profile</a></li>
-->
<li><a href="{{ url_for('logout') }}">Log Out</a></li>
</ul>
</div>
</div>
</nav>
<!-- Left Menu -->
<div class="container-fluid">
<div class="row">
<div class="col-sm-3 col-md-2 sidebar">
<ul class="nav nav-sidebar">
<li class="active"><a href="#">TF-IDF Keyword Suggest <span class="sr-only">(current)</span></a></li>
<!--
<li><a href="#">Archives</a></li>
-->
</ul>
<ul class="nav nav-sidebar">
</ul>
<!--
<ul class="nav nav-sidebar">
{% if role == 0 %}
<li><a href="">Parameters</a></li>
{% endif %}
</ul>
-->
</div>
<div class="col-sm-9 col-sm-offset-3 col-md-10 col-md-offset-2 main">
<h1 class="page-header">Welcome, {{ name }} </h1>
<div class="row placeholders">
<img src="{{url_for('static', filename='Anakeyn_Rectangle.jpg')}}" />
{% if limitReached == 1 %}
<h3 class="limit-Reached">Limit Reached!</h3>
{% endif %}
{% if limitReached == 0 %}
<form class="form-signin" method="POST" enctype=multipart/form-data action="{{ url_for('tfidfkeywordssuggest') }}">
{{ form.hidden_tag() }}
{{ wtf.form_field(form.keyword) }} {{ wtf.form_field(form.tldLang) }} <button class="btn btn-lg btn-primary btn-block" type="submit">Search</button>
</form>
{% endif %}
<h3 class="progress-title"> </h3>
<div class="progress" style="height: 22px; margin: 10px;">
<div class="progress-bar progress-bar-striped progress-bar-animated" role="progressbar" aria-valuenow="0"
aria-valuemin="0" aria-valuemax="100" style="width: 0%">
<span class="progress-bar-label">0%</span>
</div>
</div>
<div id="myResults" style="visibility:hidden;" align="center">
<table cellpadding="10" cellspacing="10">
<tr><td><a href="/popAllCSV">Download most {{ MaxResults }} popular expressions among Urls crawled</a></td><td width="10%"> </td><td><a href="/oriAllCSV">Download {{ MaxResults }} original expressions among Urls crawled</a></td></tr>
<tr><td><a href="/pop1CSV">Download most {{ MaxResults }} popular 1 word expressions among Urls crawled </a></td><td width="10%"> </td><td><a href="/ori1CSV">Download {{ MaxResults }} original 1 word expressions among Urls crawled</a></td></tr>
<tr><td><a href="/pop2CSV">Download most {{ MaxResults }} popular 2 words expressions among Urls crawled </a></td><td width="10%"> </td><td><a href="/ori2CSV">Download {{ MaxResults }} original 2 words expressions among Urls crawled</a></td></tr>
<tr><td><a href="/pop3CSV">Download most {{ MaxResults }} popular 3 words expressions among Urls crawled</a></td><td width="10%"> </td><td><a href="/ori3CSV">Download {{ MaxResults }} original 3 words expressions among Urls crawled</a></td></tr>
<tr><td><a href="/pop4CSV">Download most {{ MaxResults }} popular 4 words expressions among Urls crawled</a></td><td width="10%"> </td><td><a href="/ori4CSV">Download {{ MaxResults }} original 4 words expressions among Urls crawled</a></td></tr>
<tr><td><a href="/pop5CSV">Download most {{ MaxResults }} popular 5 words expressions among Urls crawled</a></td><td width="10%"> </td><td><a href="/ori5CSV">Download {{ MaxResults }} original 5 words expressions among Urls crawled</a></td></tr>
<tr><td><a href="/pop6CSV">Download most {{ MaxResults }} popular 6 words expressions among Urls crawled </a></td><td width="10%"> </td><td><a href="/ori6CSV">Download {{ MaxResults }} original 6 words expressions among Urls crawled</a></td></tr>
</table>
</div>
</div>
</div>
</div>
</div>
{% endblock %}
{% block styles %}
{{super()}}
<link rel="stylesheet" href="{{url_for('.static', filename='keywordssuggest.css')}}">
<script>
var source = new EventSource("/progress");
source.onmessage = function(event) {
$('.progress-bar').css('width', event.data+'%').attr('aria-valuenow', event.data);
$('.progress-bar-label').text(event.data+'%');
if(event.data ==-1 ){
$('.progress-title').text('You reached your day search limit - Please come back tomorrow!');
source.close()
}
if(event.data >0 && event.data < 50 ){
$('.progress-title').text('Search in Google, please be patient!');
}
if(event.data >=50 && event.data < 60 ){
$('.progress-title').text('Select Urls to crawl, please be patient!');
}
if(event.data >=60 && event.data < 70 ){
$('.progress-title').text('Crawl Urls, please be patient!');
}
if(event.data >=70 && event.data < 80 ){
$('.progress-title').text('Get content from Urls, please be patient!');
}
if(event.data >=80 && event.data < 90 ){
$('.progress-title').text('Save content from Urls, please be patient!');
}
if(event.data >=90 && event.data < 100 ){
$('.progress-title').text('Create TF-IDF Keywords Files, please be patient!');
}
if(event.data >= 100){
$('.progress-title').text('Process Completed - Download your TF-IDF Keywords files');
document.getElementById("myResults").style.visibility = "visible";
source.close()
}
}
</script>
{% endblock %}
Merci pour votre attention.
Suggestions et questions en commentaires bienvenues sur l’outil Anakeyn TF-IDF Keywords Suggest.
A Bientôt,
Pierre