¸üÐÂʱ¼ä:2022Äê03ÔÂ07ÈÕ18ʱ36·Ö À´Ô´:ÀÖÓãµç¾º ä¯ÀÀ´ÎÊý:
Spark RDD ±à³ÌµÄ³ÌÐòÈë¿Ú¶ÔÏóÊÇSparkContext¶ÔÏó(²»ÂÛºÎÖÖ±à³ÌÓïÑÔ)¡£Ö»Óй¹½¨³öSparkContext, »ùÓÚËü²ÅÄÜÖ´ÐкóÐøµÄAPIµ÷ÓúͼÆËã ¡£±¾ÖÊÉÏ, SparkContext¶Ô±à³ÌÀ´Ëµ, Ö÷Òª¹¦ÄܾÍÊÇ´´½¨µÚÒ»¸öRDD³öÀ´¡£
RDDµÄ´´½¨¿ÉÒÔͨ¹ý2ÖÖ·½Ê½£¬ ͨ¹ý²¢Ðл¯¼¯ºÏ´´½¨( ±¾µØ¶ÔÏóת·Ö²¼Ê½RDD )ºÍͨ¹ý¶ÁÈ¡ÍⲿÊý¾ÝÔ´( ¶ÁÈ¡Îļþ)´´½¨¡£

²¢Ðл¯´´½¨ÊÇÖ¸½«±¾µØ¼¯ºÏתÏò·Ö²¼Ê½RDD,ÕâÒ»²½µÄ´´½¨ÊÇ·Ö²¼Ê½µÄ¿ª¶Ë£¬½«±¾µØ¼¯ºÏת»¯Îª·Ö²¼Ê½¼¯ºÏ¡£
APIÈçÏÂ
rdd=sparkcontext.parallelize(²ÎÊý1£¬²ÎÊý2) #²ÎÊý1¼¯ºÏ¶ÔÏó¼´¿É£¬±ÈÈçlist #²ÎÊý2·ÖÇøÊýÍêÕû´úÂ룺
# coding: utf8
from pyspark import SparkConf, SparkContext
if __name__ = '__main__':
# e.¹¹½¨SparkÖ´Ðл·¾³
conf = SparkConf().setAppName("create rdd").\
setMaster("local[*]"]
sc = SparkContext(conf = conf)
# sc¶ÔÏóµÄparallelize·½·¨£¬ ¿ÉÒÔ½«±¾µØ¼¯ºÏת»»³ÉRDD·µ»Ø¸øÄã
data = [1, 2, 3, 4, 5, 6, 7, 8, 9]
rdd = sc.parallelize(data, numSlices = 3)
print(rdd.collect())
»ñÈ¡RDD·ÖÇøÊý·µ»ØÖµÊÇIntÊý×Ö£ºgetNumPartitions API
Ó÷¨rdd.getNumPartitions()
textFile API
Õâ¸öAPI¿ÉÒÔ¶ÁÈ¡±¾µØÊý¾Ý£¬Ò²¿ÉÒÔ¶ÁÈ¡hdfsÊý¾Ý
ʹÓ÷½·¨£º
sparkcontext.textFile(²ÎÊý1£¬²ÎÊý2) #²ÎÊý1£¬±ØÌÎļþ·¾¶Ö§³Ö±¾µØÎļþÖ§³ÖHDFSÒ²Ö§³ÖһЩ±ÈÈçS3ÐÒé #²ÎÊý2£¬¿ÉÑ¡£¬±íʾ×îС·ÖÇøÊýÁ¿¡£ #×¢Ò⣺²ÎÊý2»°ÓïȨ²»×㣬sparkÓÐ×Ô¼ºµÄÅжϣ¬ÔÚËüÔÊÐíµÄ·¶Î§ÄÚ£¬²ÎÊý2ÓÐЧ¹û£¬³¬³ösparkÔÊÐíµÄ·¶Î§£¬²ÎÊý2ʧЧÍêÕû´úÂë
1f __nane__ = '__main__:
# B.¹¹½¨SparkÖ´Ðл·¾³
conf = SparkConf().setAppNane("create rdd").\
setMaster("local[*]")
sc = SparkContext(conf=conf)
# textFile API ¶ÁÈ¡Îļþ
rdd = sc.textFile(".…/data/words.txt", 1000)
print(rdd.getNumPartitions())
rdd2 = sc.textFile("hdfs://nodel:8020/input/words.txt", 1888)
#×îС·ÖÇøÊý¸øÁË1060£¬µ«ÊÇʵ¼Ê¾Í¿ªÁË85¸ö£¬ sparkûÓÐÀí»áÄãÒªÇó×îÉÙ1008µÄÒªÇ󣬶øÊǾ¡ÊǶ࿪¡£
print(rdd2.getNumPartitions())
print(rdd2.collect())
×¢Ò⣺textFile³ý·ÇÓкÜÃ÷È·µÄÖ¸ÏòÐÔ£¬Ò»°ãÇé¿öÏ£¬ÎÒÃDz»ÊÇÖ¸·ÖÇø²ÎÊý¡£
¶ÁÈ¡ÎļþµÄAPI£¬ÓиöСÎļþ¶ÁȡרÓó¡¾°£ºÊʺ϶Áȡһ¶ÑСÎļþ
Ó÷¨£º
sparkcontext.wholeTextFiles(²ÎÊý1,²ÎÊý2) #²ÎÊý1£¬±ØÌÎļþ·¾¶Ö§³Ö±¾µØÎļþÖ§³ÖHDFSÒ²Ö§³ÖһЩ±ÈÈçS3ÐÒé #²ÎÊý2£¬¿ÉÑ¡£¬±íʾ×îС·ÖÇøÊýÁ¿¡£ #×¢Ò⣺²ÎÊý2»°ÓïȨ²»×㣬Õâ¸öAPI·ÖÇøÊýÁ¿×î¶àÒ²Ö»ÄÜ¿ªµ½ÎļþÊýÁ¿
Õâ¸öAPIÆ«ÏòÓÚÉÙÁ¿·ÖÇø¶ÁÈ¡Êý¾Ý,ÒòΪÕâ¸öAPI±íÃ÷ÁË×Ô¼ºÊÇСÎļþ¶ÁȡרÓã¬ÄÇôÎļþµÄÊý¾ÝºÜС¡£·ÖÇøºÜ¶à£¬µ¼ÖÂshuffleµÄ¼¸Âʸü¸ß.ËùÒÔ¾¡Á¿ÉÙ·ÖÇø¶ÁÈ¡Êý¾Ý¡£
RDDÔÚSparkÖÐÊÇÈçºÎÔËÐеģ¿
DataFrameÊÇʲôÒâ˼?ÓëRDDÏà±ÈÓÐÄÄЩÓŵ㣿
RDDΪʲôҪ½øÐÐÊý¾Ý³Ö¾Ã»¯£¿³Ö¾Ã»¯²Ù×÷²½Öè
Á½ÖÖRDDµÄÒÀÀµ¹ØÏµ½éÉÜ
ÀÖÓãµç¾ºpython+´óÊý¾Ý¿ª·¢Åàѵ¿Î³Ì
±±¾©Ð£Çø